如何使用clusterProfiler包利用eggnog-mapper软件注释结果做GO和KEGG富集分析

发布时间：2021-09-19 19:29:59 作者：小新
来源：亿速云阅读：1367

# 如何使用clusterProfiler包利用eggnog-mapper软件注释结果做GO和KEGG富集分析

## 摘要
本文详细介绍了如何利用R语言中的clusterProfiler包对eggnog-mapper软件的注释结果进行基因本体论(GO)和京都基因与基因组百科全书(KEGG)通路富集分析。文章包含完整的分析流程、代码示例、结果解读以及常见问题解决方案，适用于生物信息学研究人员进行功能基因组学研究。

## 引言

功能富集分析是解读高通量组学数据的重要方法，能够帮助研究者理解差异表达基因或目标基因集合的生物学功能。eggnog-mapper是一款高效的直系同源蛋白注释工具，而clusterProfiler是R语言中最流行的富集分析工具之一。

本文将分步指导如何：

1. 准备eggnog-mapper注释结果
2. 使用clusterProfiler进行GO富集分析
3. 进行KEGG通路富集分析
4. 可视化富集结果

## 1. 软件安装与环境准备

### 1.1 安装R和必要包

```r
# 安装clusterProfiler及相关包
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("clusterProfiler", "org.Hs.eg.db", "enrichplot", "DOSE"))

1.2 加载所需库

library(clusterProfiler)
library(org.Hs.eg.db) # 根据物种选择对应数据库
library(enrichplot)
library(ggplot2)

注意：根据研究对象选择适当的OrgDb，如小鼠用org.Mm.eg.db，大鼠用org.Rn.eg.db等。

2. eggnog-mapper结果预处理

2.1 eggnog-mapper输出文件格式

典型输出文件(annotations.emapper.annotations)包含多列信息，关键列包括：

query_name: 基因/蛋白ID
GOs: GO术语(如GO:0008150,GO:0003674)
KEGG_ko: KEGG通路标识(如ko00010,ko00230)

2.2 数据导入与清洗

# 读取eggnog注释结果
eggnog <- read.delim("annotations.emapper.annotations", 
                    header = TRUE, 
                    comment.char = "#",
                    stringsAsFactors = FALSE)

# 提取基因ID和GO注释
gene_go <- eggnog[, c("query_name", "GOs")]
colnames(gene_go) <- c("GeneID", "GO")

# 去除无GO注释的基因
gene_go <- gene_go[gene_go$GO != "", ]

2.3 构建GO注释列表

# 分割多个GO项
go_list <- strsplit(gene_go$GO, ",")

# 创建数据框
go_annot <- data.frame(
  GeneID = rep(gene_go$GeneID, sapply(go_list, length)),
  GO = unlist(go_list)
)

# 去除重复项
go_annot <- unique(go_annot)

3. GO富集分析

3.1 准备基因列表

假设我们有一个感兴趣的基因集：

interest_genes <- c("gene1", "gene2", "gene3", ...) # 替换为实际基因ID

3.2 GO富集分析

# 使用enricher函数进行自定义GO富集
ego <- enricher(
  gene = interest_genes,
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  universe = unique(go_annot$GeneID), # 背景基因集
  TERM2GENE = go_annot[, c("GO", "GeneID")]
)

# 查看简要结果
head(ego)

3.3 使用OrgDb进行GO富集（替代方法）

如果基因ID能被OrgDb识别：

ego <- enrichGO(
  gene = interest_genes,
  OrgDb = org.Hs.eg.db,
  keyType = "SYMBOL", # 根据ID类型调整
  ont = "ALL", # 可指定"BP","MF"或"CC"
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  qvalueCutoff = 0.2
)

4. KEGG富集分析

4.1 提取KEGG注释

# 从eggnog结果提取KEGG信息
gene_kegg <- eggnog[, c("query_name", "KEGG_ko")]
gene_kegg <- gene_kegg[gene_kegg$KEGG_ko != "", ]

# 分割多个KEGG条目
kegg_list <- strsplit(gene_kegg$KEGG_ko, ",")
kegg_annot <- data.frame(
  GeneID = rep(gene_kegg$query_name, sapply(kegg_list, length)),
  KEGG = unlist(kegg_list)
)

4.2 KEGG富集分析

ekegg <- enricher(
  gene = interest_genes,
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  universe = unique(kegg_annot$GeneID),
  TERM2GENE = kegg_annot[, c("KEGG", "GeneID")]
)

# 或使用clusterProfiler内置KEGG数据库
ekegg <- enrichKEGG(
  gene = interest_genes,
  organism = "hsa", # 人类，其他物种需更改
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH"
)

5. 结果可视化

5.1 条形图

barplot(ego, showCategory = 20)

5.2 点图

dotplot(ego, showCategory = 15)

5.3 网络图

cnetplot(ego, categorySize = "pvalue", foldChange = geneList)

5.4 GO有向无环图

goplot(ego)

5.5 KEGG通路图

# 需要pathview包
library(pathview)
pathview(gene.data = geneList, pathway.id = "hsa04110")

6. 高级分析与结果解读

6.1 GO语义相似性分析

# 简化冗余GO项
ego2 <- simplify(ego, cutoff = 0.7, by = "p.adjust")

6.2 多组比较

# 比较不同基因集的富集结果
ck <- compareCluster(geneCluster = list(Group1 = genes1, Group2 = genes2),
                   fun = "enrichGO",
                   OrgDb = org.Hs.eg.db)
dotplot(ck)

6.3 结果导出

# 保存结果表格
write.csv(ego@result, "GO_enrichment_results.csv")
write.csv(ekegg@result, "KEGG_enrichment_results.csv")

# 保存图形
ggsave("GO_dotplot.png", dotplot(ego), width = 10, height = 8)

7. 常见问题解答

Q1: 如何解决”No gene can be mapped”错误？

A: 检查基因ID类型是否正确，必要时使用bitr()函数转换ID：

gene_df <- bitr(interest_genes, 
               fromType = "SYMBOL", 
               toType = "ENTREZID",
               OrgDb = org.Hs.eg.db)

Q2: 如何选择合适的背景基因集？

A: 背景集应包含所有可能被检测到的基因，通常使用: - 整个转录组测序检测到的基因 - 芯片上所有的探针对应基因 - 整个基因组基因集

Q3: 富集结果不显著怎么办？

A: 尝试: 1. 放宽p值阈值(如0.1) 2. 使用更宽松的校正方法(“none”或”fdr”) 3. 检查基因ID转换是否正确 4. 考虑使用GSEA方法

8. 结论

本文介绍了基于eggnog-mapper注释结果进行GO和KEGG富集分析的完整流程。通过clusterProfiler包，研究人员可以高效地解读基因集的生物学功能和通路特征。该方法可广泛应用于各种组学数据的生物功能分析。

参考文献

Wu T, et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021.
Huerta-Cepas J, et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019.
Kanehisa M, et al. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017.

附录：完整代码示例

# 完整示例代码
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)

# 1. 读取数据
eggnog <- read.delim("annotations.emapper.annotations", header = TRUE, comment.char = "#")

# 2. 准备GO注释
gene_go <- eggnog[, c("query_name", "GOs")]
gene_go <- gene_go[gene_go$GOs != "", ]
go_list <- strsplit(gene_go$GOs, ",")
go_annot <- data.frame(
  GeneID = rep(gene_go$query_name, sapply(go_list, length)),
  GO = unlist(go_list)
)

# 3. 富集分析
interest_genes <- c("gene1", "gene2", "gene3") # 替换为实际基因
ego <- enricher(gene = interest_genes,
               pvalueCutoff = 0.05,
               universe = unique(go_annot$GeneID),
               TERM2GENE = go_annot)

# 4. 可视化
dotplot(ego, showCategory = 15)

提示：实际分析时请根据具体数据调整参数和步骤。 “`