clusterProfile
clusterProfiler初测
clusterProfiler是一个基于基因、基因集的集统计分析和可视化的R包,支持多种over-representation检验和基因本体富集分析,可进行分析如:Gene Ontology、EGG Pathway、MeSH 、Reactome Pathway、DAVID 、Disease Ontology等。
R包说明文档参考:
【功能概要】 https://guangchuangyu.github.io/software/clusterProfiler/
【使用说明】 https://guangchuangyu.github.io/2016/01/go-analysis-using-clusterprofiler/
【使用手册】https://yulab-smu.github.io/clusterProfiler-book/chapter1.html,,很详细
【参考文献】 Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
Gene Ontology富集分析:
clusterProfiler提供使用物种的OrgDb 对象来进行富集分析,Bioconductor官网已经提供了多达20个物种的OrgDb对象文件;当然用户也可以使用AnnotationHub
(该基于R 3.6构建)包来自定义构建OrgDb对象。
Bioconductor OrgDb:OrgDb
1. 如何使用
1.1 基于已有的OrgDb
对象进行GO富集分析
使用clusterProfiler需要提前做的是安装相应物种的OrgDb包,这一点比较麻烦,可使用 BiocManager::install("包") 进行安装。 clusterProfiler所使用的OrgDb支持多种基因ID类型,不推荐使用SYMBOL类型。
需要安装的包有:
clusterProfiler、AnnotationHub(自建库)、topGO(DAG分析)
clusterProfiler版本:3.12.0
富集分析由enrichGO模块功能来实现:
enrichGO GO Enrichment Analysis of a gene set. Given a vector of genes, this
function will return the enrichment GO categories after FDR control.
Usage:
enrichGO(gene, OrgDb, keyType = "ENTREZID", ont = "MF", pvalueCutoff = 0.05,
pAdjustMethod = "BH", universe, qvalueCutoff = 0.2, minGSSize = 10,
maxGSSize = 500, readable = FALSE, pool = FALSE)
Arguments:
gene a vector of entrez gene id.
OrgDb OrgDb
keyType keytype of input gene
ont One of "MF", "BP", and "CC" subontologies or 'ALL'.
pvalueCutoff Cutoff value of pvalue.
pAdjustMethod one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
universe background genes
qvalueCutoff qvalue cutoff
minGSSize minimal size of genes annotated by Ontology term for testing.
maxGSSize maximal size of genes annotated for testing
readable whether mapping gene ID to gene Name
pool If ont=’ALL’, whether pool 3 GO sub-ontologies
操作实例:
#载入Hs包:Genome wide annotation for Human
> library(org.Hs.eg.db)
# 看看Db库中包含哪些内容
> keytypes(org.Hs.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GO" "GOALL" "IPI" "MAP" "OMIM"
[16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
[21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" "UNIGENE"
[26] "UNIPROT"
# 造一个测试数据列表,内容为ENSEMBL基因ID
> DATA <- read.delim("geneList.txt", sep="\n", header=F)
> head(DATA)
V1
1 ENSG00000196611
2 ENSG00000093009
3 ENSG00000109255
4 ENSG00000134690
5 ENSG00000065328
6 ENSG00000117399
> gene1 <- c(t(DATA))
> gene1
[1] "ENSG00000196611" "ENSG00000093009" "ENSG00000109255" "ENSG00000134690"
[5] "ENSG00000065328" "ENSG00000117399"
#通过ensembl的基因ID号可以找到Db文件中其他信息的对应关系,例如:
> cols <- c("SYMBOL", "GENENAME")
# 使用select函数读取gene1在Db文件中的对应信息
> select(org.Hs.eg.db, keys=gene1, columns=cols, keytype="ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
ENSEMBL SYMBOL
1 ENSG00000196611 MMP1
2 ENSG00000093009 CDC45
3 ENSG00000109255 NMU
4 ENSG00000134690 CDCA8
5 ENSG00000065328 MCM10
6 ENSG00000117399 CDC20
GENENAME
1 matrix metallopeptidase 1
2 cell division cycle 45
3 neuromedin U
4 cell division cycle associated 8
5 minichromosome maintenance 10 replication initiation factor
6 cell division cycle 20
#再例如检索SOX1基因,以SYMBOL基因为属性
> select(org.Hs.eg.db, keys="SOX1", columns=c("ENSEMBL","ENTREZID","GO","GENENAME"), keytype="SYMBOL")
'select()' returned 1:many mapping between keys and columns
SYMBOL ENSEMBL ENTREZID GO EVIDENCE ONTOLOGY GENENAME
1 SOX1 ENSG00000182968 6656 GO:0000978 IDA MF SRY-box 1
2 SOX1 ENSG00000182968 6656 GO:0000981 ISA MF SRY-box 1
3 SOX1 ENSG00000182968 6656 GO:0000981 ISM MF SRY-box 1
4 SOX1 ENSG00000182968 6656 GO:0000981 NAS MF SRY-box 1
5 SOX1 ENSG00000182968 6656 GO:0001228 IEA MF SRY-box 1
... ...
# 此外,提供bitr方法进行数据ID的转换
# 例如,从ENSEMBL编号转换为ENTREZID编号,但都是基于数据库记录的信息
bitr(gene1, fromType="ENSEMBL", toType=c("ENTREZID","SYMBOL"), OrgDb="org.Hs.eg.db")
1 ENSG00000120784 22835 ZFP30
2 ENSG00000120798 7181 NR2C1
... ...
Warning message:
In bitr(gene1, fromType = "ENSEMBL", toType = c("ENTREZID", "SYMBOL"), :
0.62% of input gene IDs are fail to map...
# 所以,clusterProfiler基于上述数据结构可以根据任意的获取基因“ID”和多种注释信息
# keytype支持ENSEMBL、SYMBOL、ENTREZID
# 基于gene1列表调取Db文件中的信息
> gene.df <- select(org.Hs.eg.db, keys=gene1, columns=c("ENTREZID", "ENSEMBL", "SYMBOL"), keytype="ENSEMBL")
> head(gene.df)
ENSEMBL ENTREZID SYMBOL
1 ENSG00000196611 4312 MMP1
2 ENSG00000093009 8318 CDC45
3 ENSG00000109255 10874 NMU
4 ENSG00000134690 55143 CDCA8
5 ENSG00000065328 55388 MCM10
6 ENSG00000117399 991 CDC20
# 载入包
> library(clusterProfiler)
# 对细胞组分大类CC进行富集分析
> go <- enrichGO(gene=gene.df$ENSEMBL, OrgDb=org.Hs.eg.db, keyType='ENSEMBL', ont="CC", pAdjustMethod="BH", pvalueCutoff=0.01, qvalueCutoff=0.05)
# ont也可设置'ALL',即对BP,CC和MF进行整体分析,而不是单独进行富集分析
# 如果基因数目少,无富集会产生以下结果
> go
#
# over-representation test
#
#...@organism Homo sapiens
#...@ontology CC
#...@keytype ENSEMBL
#...@gene chr [1:6] "ENSG00000196611" "ENSG00000093009" "ENSG00000109255" ...
#...pvalues adjusted by 'BH' with cutoff <0.01
#...0 enriched terms found
# 若有富集则显示以下结果:
> go
#
# over-representation test
#
#...@organism Homo sapiens
#...@ontology CC
#...@keytype ENSEMBL
#...@gene chr [1:5000] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" ...
#...pvalues adjusted by 'BH' with cutoff <0.01
#...83 enriched terms found
'data.frame': 83 obs. of 9 variables:
$ ID : chr "GO:0015629" "GO:0005819" "GO:0099568" "GO:0098793" ...
$ Description: chr "actin cytoskeleton" "spindle" "cytoplasmic region" "presynapse" ...
$ GeneRatio : chr "164/4726" "126/4726" "146/4726" "157/4726" ...
$ BgRatio : chr "464/20630" "348/20630" "423/20630" "467/20630" ...
$ pvalue : num 5.63e-10 1.09e-08 2.83e-08 6.22e-08 1.88e-07 ...
$ p.adjust : num 4.19e-07 4.05e-06 7.02e-06 1.16e-05 2.80e-05 ...
$ qvalue : num 2.82e-07 2.73e-06 4.72e-06 7.79e-06 1.88e-05 ...
$ geneID : chr "ENSG00000000938/ENSG00000002834/ENSG00000005206/ENSG00000005471/ENSG00000006453/ENSG00000006747/ENSG00000006788"| __truncated__ "ENSG00000002822/ENSG00000004897/ENSG00000005022/ENSG00000007168/ENSG00000008323/ENSG00000010244/ENSG00000013573"| __truncated__ "ENSG00000002834/ENSG00000004939/ENSG00000006747/ENSG00000007168/ENSG00000007171/ENSG00000007174/ENSG00000007908"| __truncated__ "ENSG00000003147/ENSG00000005379/ENSG00000007516/ENSG00000008056/ENSG00000008282/ENSG00000011347/ENSG00000018236"| __truncated__ ...
$ Count : int 164 126 146 157 127 26 142 113 126 131 ...
# GO富集统计如下:
> summary(go)
# 数据输出
data <- as.data.frame(go)
write.table(data, file="go_enrich.xls", row.names=F, sep="\t")
1.2 自己构建Db文件
通过使用AnnotationHub包进行构建
What is AnnotationHub ?
AnnotationHub is a new approach to providing annotation resources to the Bioconductor community. The initial plan is to make available fasta, GFF / GTF, BED, VCF, and similar files from entities such as Ensembl, UCSC, ENCODE, and 1000 Genomes projects as objects ready for work in Bioconductor, e.g., GRanges representations of BED files.
# 依照作者测试
library(AnnotationHub)
hub <- AnnotationHub()
# 该步骤测试失败,可能是网络问题:
#Error in .util_download(x, rid[i], proxy, config, "bfcadd()", ...) :
# bfcadd() failed; see warnings()
#In addition: Warning messages:
#1: download failed
# web resource path: #‘https://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3’
# local file path: ‘/root/.cache/AnnotationHub/4272463205a1_annotationhub.sqlite3’
# reason: An unknown option was passed in to libcurl
#2: bfcadd() failed; resource removed
# rid: BFC3
# fpath: ‘https://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3’
# reason: download failed
》 确实有遇到过该问题: 手动下载,https://links.jianshu.com/go?to=https%3A%2F%2Fannotationhub.bioconductor.org%2Fmetadata%2Fannotationhub.sqlite3
# 选择一个物种
query(hub, "Cricetulus")
## AnnotationHub with 4 records
## # snapshotDate(): 2015-12-29
## # $dataprovider: UCSC, Inparanoid8, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Cricetulus griseus
## # $rdataclass: ChainFile, Inparanoid8Db, OrgDb, TwoBitFile
## # additional mcols(): taxonomyid, genome, description, tags,
## # sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH10393"]]'
##
## title
## AH10393 | hom.Cricetulus_griseus.inp8.sqlite
## AH13980 | criGri1.2bit
## AH14346 | criGri1ToHg19.over.chain.gz
## AH48061 | org.Cricetulus_griseus.eg.sqlite
# 基于列出的文件进行选择
Cgriseus <- hub[["AH48061"]]
关于AnnotationHub使用可参考: https://www.cnblogs.com/yatouhetademao/p/8018252.html
1.3 富集结果可视化
# 条形图
barplot(go,showCategory=20,drop=T)
# 散点图
dotplot(go, showCategory=20)
# 这步测试报错: wrong orderBy parameter; set to default `orderBy = "x"`
# DAG有向无环图绘制
# 只能单独对BP、CC或MF进行DAG绘制
plotGOgraph(go)
# 若go为整体富集,如仅对BP组分绘制DAG
plotGOgraph(go.BP)
散点图示意图:
1.4 KEGG富集分析
KEGG Enrichment Analysis of a gene set. Given a vector of genes,
this function will return the enrichment KEGG categories with FDR control.
Usage:
enrichKEGG(gene, organism = "hsa", keyType = "kegg",
pvalueCutoff = 0.05, pAdjustMethod = "BH", universe,
minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2,
use_internal_data = FALSE)
Arguments:
gene a vector of entrez gene id.
organism supported organism listed in ’http://www.genome.jp/kegg/catalog/org_list.html’
keyType one of "kegg", ’ncbi-geneid’, ’ncib-proteinid’ and ’uniprot’
pvalueCutoff Cutoff value of pvalue.
pAdjustMethod one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
universe background genes
minGSSize minimal size of genes annotated by Ontology term for testing.
maxGSSize maximal size of genes annotated for testing
qvalueCutoff qvalue cutoff
use_internal_data logical, use KEGG.db or latest online KEGG data
keyType支持kegg、 ncbi-geneid、ncib-proteinid和uniprot
enrichKEGG只支持entrez gene ID,这一点比较麻烦
# 数据ID转换为entrez 编号
gene2 <- bitr(gene1, fromType="ENSEMBL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
data <- gene2$ENTREZID
#
kegg <- enrichKEGG(data,
organism = 'hsa',
keyType = 'kegg',
pvalueCutoff = 0.05,
pAdjustMethod = 'BH',
minGSSize = 10,
maxGSSize = 500,
qvalueCutoff = 0.2,
use_internal_data = FALSE)
# 查看数据分析结果
head(kegg)
ID Description GeneRatio
hsa05165 hsa05165 Human papillomavirus infection 139/2395
hsa04010 hsa04010 MAPK signaling pathway 123/2395
hsa04120 hsa04120 Ubiquitin mediated proteolysis 64/2395
hsa04510 hsa04510 Focal adhesion 87/2395
hsa04141 hsa04141 Protein processing in endoplasmic reticulum 73/2395
hsa04668 hsa04668 TNF signaling pathway 53/2395
~
BgRatio pvalue p.adjust qvalue geneID Count
330/7878 2.726058e-06 0.0008696126 0.0006628626 572/51384/... 139
295/7878 1.829656e-05 0.0029183006 0.0022244760 020/10368/... 123
137/7878 3.777734e-05 0.0032083090 0.0024455350 90293/996/... 64
199/7878 4.022958e-05 0.0032083090 0.0024455350 572/3675/... 87
165/7878 1.038498e-04 0.0059955161 0.0045700849 7095/51360/... 73
112/7878 1.127683e-04 0.0059955161 0.0045700849 843/8837/... 53
kegg富集简单可视化:
# 散点图
dotplot(kegg, showCategory=30)
1.5 总结
- clusterProfile使用简单,功能丰富,GO、KEGG、DAVID、GSEA等,非常适合对基因/基因集进行功能研究;
- 使用该包需要克服的问题是需要建库,最好将20种物种和生僻物种数据文件本地化,所以关键是自己构建OrgDb;
- 作者一直在维护和更新,不时加入新功能和改进。
补充:非模式物种clusterProfiler进行富集分析 :
# 载入包
> library(clusterProfiler)
# 设置我的工作路径
> setwd("/home/shaohuaihan/bin/Iso-seq/image/R")
# 读取GO注释文件
> go <- read.table("test_go_annot.txt", sep="\t", quote="")
# 查看GO库数据格式
> head(go)
V1 V2 V3 V4
1 GO:0000015 phosphopyruvate_hydratase_complex cellular_component gene1
2 GO:0000015 phosphopyruvate_hydratase_complex cellular_component gene2
3 GO:0000015 phosphopyruvate_hydratase_complex cellular_component gene2
# 各列分别是GO term,GO sub term,GO top term和基因ID
# 读取差异基因列表
> gene <- read.table("test_gene.txt")
# 查看基因数据
> head(gene)
V1
1 gene100
2 gene200
3 gene300
# 处理差异基因列表,"data.frame" —> "character"
> gene <- as.vector(gene$V1)
# GO to gene数据框
> go_term2gene <- data.frame(go$V1, go$V4)
> head(go_term2gene)
go.V1 go.V4
1 GO:0000015 gene10
2 GO:0000015 gene23
3 GO:0000015 gene32
# 表头命名
> names(go_term2gene) <- c("go_term", "gene")
# GO term to GO name数据框
> go_term2name <- data.frame(go$V1, go$V2)
> head(go_term2name)
go.V1 go.V2
1 GO:0000015 phosphopyruvate_hydratase_complex
2 GO:0000015 phosphopyruvate_hydratase_complex
3 GO:0000015 phosphopyruvate_hydratase_complex
# 表头命名
> names(go_term2name) <- c("go_term", "name")
# 至此,建立了GO Term与基因ID和功能的关系
# 使用enrichr函数进行GO富集分析
> GO <- enricher(gene=gene, pvalueCutoff = 0.05, qvalueCutoff = 0.2, pAdjustMethod = "BH", TERM2GENE = go_term2gene, TERM2NAME = go_term2name)
# 查 看 结 果
> GO
# over-representation test
#
#...@organism UNKNOWN
#...@ontology UNKNOWN
#...@gene chr [1:774] "Lal000100" "Lal001330" "Lal001430" "Lal001690" "Lal001990" ...
#...pvalues adjusted by 'BH' with cutoff <0.05
#...6 enriched terms found
# 省略,与上文富集展示一样
# 文件输出
> write.table(as.data.frame(GO), file="GO_enrichment.txt", row.names = F, sep="\t")
# 表头如下:
# ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
# 可视化
# 条形图
> barplot(GO, showCategory=10)
# 散点图
> dotplot(GO) #还是报错: wrong orderBy parameter;set to default `orderBy = "x"` ^_^!
# ontology追加细胞组分(CC)
> GO@ontology = "CC"
# 注意:由于不依赖于软甲的OrgDb数据库,CC,MF,BP基因需要单独去做富集分析
### KEGG套路一样 ###
# 读取KEGG注释文件
> kegg <- read.table("test_kegg_annot.txt", sep="\t", quote="")
# 查看数据
> head(kegg)
V1 V2 V3
1 gene1 K15397 KCS
2 gene2 K01051 E3.1.1.11
3 gene3 K14508 NPR1
# 同理
> kegg_term2gene <- data.frame(kegg$V2, kegg$V1)
> kegg_term2name <- data.frame(kegg$V2, kegg$V3)
> names(kegg_term2gene) <- c("ko_term", "gene")
> names(kegg_term2name) <- c("ko_term", "name")
# 查看数据
> head(kegg_term2gene)
ko_term gene
1 K15397 gene8
2 K01051 gen11
3 K14508 gene92
> head(kegg_term2name)
ko_term name
1 K15397 KCS
2 K01051 E3.1.1.11
3 K14508 NPR1
# 使用enrichr函数进行KEGG的富集分析
> KEGG <- enricher(gene=gene, pvalueCutoff = 0.05, qvalueCutoff = 0.2, AdjustMethod = "BH", TERM2GENE = kegg_term2gene, TERM2NAME = kegg_term2name)
# 输出到文件
> write.table(as.data.frame(KEGG), file="KEGG_enrichment.txt", row.names = F, sep="\t")
# 表头信息如下
# ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
# 可视化
> barplot(KEGG, showCategory=10)
> dotplot(KEGG)
# dev.off()
关于enricher方法:
enricher package:clusterProfiler R Documentation
enricher
Description:
A universal enrichment analyzer
Usage:
enricher(gene, pvalueCutoff = 0.05, pAdjustMethod = "BH", universe,
minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2, TERM2GENE,
TERM2NAME = NA)
Arguments:
gene: a vector of gene id
pvalueCutoff: pvalue cutoff
pAdjustMethod: one of "holm", "hochberg", "hommel", "bonferroni", "BH",
"BY", "fdr", "none"
universe: background genes
minGSSize: minimal size of genes annotated for testing
maxGSSize: maximal size of genes annotated for testing
qvalueCutoff: qvalue cutoff
TERM2GENE: user input annotation of TERM TO GENE mapping, a data.frame
of 2 column with term and gene
TERM2NAME: user input of TERM TO NAME mapping, a data.frame of 2 column
with term and name
Value:
A ‘enrichResult’ instance
Author(s):
Guangchuang Yu
一个通用的超几何检统计检验方法
2019/07/23-24