关注微信:生信小博士
5 Overview of enrichment analysis
Chapter 5 Overview of enrichment analysis | Biomedical Knowledge Mining using GOSemSim and clusterProfiler
5.1.2 Gene Ontology (GO)
Gene Ontology defines concepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects:
- MF: Molecular Function
- molecular activities of gene products
- CC: Cellular Component
- where gene products are active
- BP: Biological Process
- pathways and larger processes made up of the activities of multiple gene products
GO terms are organized in a directed acyclic graph, where edges between terms represent parent-child relationship.
5.1.3 Kyoto Encyclopedia of Genes and Genomes (KEGG)
KEGG is a collection of manually drawn pathway maps representing molecular interaction and reaction networks. These pathways cover a wide range of biochemical processes that can be divided into 7 broad categories:
- Metabolism
- Genetic information processing
- Environmental information processing
- Cellular processes
- Organismal systems
- Human diseases
- Drug development.
5.1.4 Other gene sets
GO and KEGG are the most frequently used for functional analysis. They are typically the first choice because of their long-standing curation and availability for a wide range of species.
Other gene sets include but are not limited to Disease Ontology (DO), Disease Gene Network (DisGeNET), wikiPathways, Molecular Signatures Database (MSigDb).
富集分析的原理
Over Representation Analysis (ORA)和Gene Set Enrichment Analysis (GSEA)是用于分析基因表达数据的两种常见方法。ORA基于预定义的基因集,用超几何分布计算差异表达基因与该基因集中的基因数之间的显著性,适用于已知基因集和差异表达较大情况。GSEA考虑所有基因,根据基因在表型排序上的变化来寻找整个基因集是否显著富集,比ORA更适用于寻找微小但协调变化的基因集。GSEA还包括较新的leading edge analysis和core enriched genes等功能,能够帮助确定影响基因富集的关键基因。两种方法各有优缺点,在不同的研究问题中可以灵活选用。
5.2 原理1 Over Representation Analysis
Example: Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differentially expressed genes, 28 are annotated to a gene set1.
d <- data.frame(gene.not.interest=c(2613, 15310), gene.in.interest=c(28, 29))
row.names(d) <- c("In_category", "not_in_category")
d
## gene.not.interest gene.in.interest
## In_category 2613 28
## not_in_category 15310 29
Whether the overlap(s) of 25 genes are significantly over represented in the gene set can be assessed using a hypergeometric distribution. This corresponds to a one-sided version of Fisher’s exact test.
fisher.test(d, alternative = "greater")
5.3 原理2 Gene Set Enrichment Analysis
ORA的方法,对于差异明显的对比结果较好,但是对于差异小的结果不好
A common approach to analyzing gene expression profiles is identifying differentially expressed genes that are deemed interesting. The ORA enrichment analysis is based on these differentially expressed genes. This approach will find genes where the difference is large and will fail where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)(Subramanian et al. 2005) directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. This is important since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
Genes are ranked based on their phenotypes. Given apriori defined set of gene S (e.g., genes sharing the same DO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (L) or primarily found at the top or bottom.
There are three key elements of the GSEA method:
- Calculation of an Enrichment Score.
- The enrichment score (ES) represents the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not encountered. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov(KS)-like statistic (Subramanian et al. 2005).
- Esimation of Significance Level of ES.
- The p-value of the ES is calculated using a permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generate a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
- Adjustment for Multiple Hypothesis Testing.
- When the entire gene sets are evaluated, the estimated significance level is adjusted to account for multiple hypothesis testing and also q-values are calculated for FDR control.
We implemented the GSEA algorithm proposed by Subramanian (Subramanian et al. 2005). Alexey Sergushichev implemented an algorithm for fast GSEA calculation in the fgsea (Korotkevich, Sukhov, and Sergushichev 2019) package. In our packages (clusterProfiler, DOSE, meshes and ReactomePA), users can use the GSEA algorithm implemented in DOSE
or fgsea
by specifying the parameter by="DOSE"
or by="fgsea"
. By default, the fgsea
method will be used since it is much more faster.
5.4 原理3 Leading edge analysis and core enriched genes
Leading edge analysis reports Tags
to indicate the percentage of genes contributing to the enrichment score, List
to indicate where in the list the enrichment score is attained and Signal
for enrichment signal strength.
It would also be very interesting to get the core enriched genes that contribute to the enrichment. Our packages (clusterProfiler, DOSE, meshes and ReactomePA) support leading edge analysis and report core enriched genes in GSEA analysis.
详细的富集分析使用范围及其原理
1 GO 富集
GO comprises three orthogonal ontologies, i.e. molecular function (MF), biological process (BP), and cellular component (CC).
6.1 Supported organisms
GO analyses (groupGO()
, enrichGO()
and gseGO()
) support organisms that have an OrgDb
object available (see also session 2.2).
If a user has GO annotation data (in a data.frame
format with the first column as gene ID and the second column as GO ID), they can use the enricher()
and GSEA()
functions to perform an over-representation test and gene set enrichment analysis.
If the genes are annotated by direction annotation, they should also be annotated by their ancestor GO nodes (indirect annotation). If a user only has direct annotation, they can pass their annotation to the buildGOmap
function, which will infer indirect annotation and generate a data.frame
that is suitable for both enricher()
and GSEA()
.
6.2 groupGo是 对基因的分类
In clusterProfiler, the groupGO()
function is designed for gene classification based on GO distribution at a specific level. Here we use the dataset geneList
provided by DOSE.
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
# Entrez gene ID
head(gene)
## [1] "4312" "8318" "10874" "55143" "55388" "991"
ggo <- groupGO(gene = gene,
OrgDb = org.Hs.eg.db,
ont = "CC",
level = 3,
readable = TRUE)
head(ggo)
6.3 enrichGO()是通过ORA对基因进行富集分析
The clusterProfiler package implements enrichGO()
for gene ontology over-representation test.
ego <- enrichGO(gene = gene,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(ego)
gene.df <- bitr(gene, fromType = "ENTREZID",
toType = c("ENSEMBL", "SYMBOL"),
OrgDb = org.Hs.eg.db)
ego2 <- enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
head(ego2, 3)
6.4 gseGO()
是对基因集进行gsea分析,而GSEA是对基因列表基因gsea分析??
gseGO()
The clusterProfiler package provides the gseGO()
function for gene set enrichment analysis using gene ontology.
ego3 <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)
The format of input data, geneList
, was documented in the FAQ. Beware that only gene Set size in [minGSSize, maxGSSize]
will be tested.
A.1 How to prepare your own geneList
GSEA analysis requires a ranked gene list, which contains three features:
- numeric vector: fold change or other type of numerical variable
- named vector: every number has a name, the corresponding gene ID
- sorted vector: number should be sorted in decreasing order
If you import your data from a csv
file, the file should contains two columns, one for gene ID (no duplicated ID allowed) and another one for fold change. You can prepare your own geneList
via the following command:
d = read.csv(your_csv_file)
## assume 1st column is ID
## 2nd column is FC
## feature 1: numeric vector
geneList = d[,2]
## feature 2: named vector
names(geneList) = as.character(d[,1])
## feature 3: decreasing orde
geneList = sort(geneList, decreasing = TRUE)
7.2 KEGG pathway ORA
Input ID type can be kegg
, ncbi-geneid
, ncbi-proteinid
or uniprot
(see also session 16.1.2). Unlike enrichGO()
, there is no readable
parameter for enrichKEGG()
. However, users can use the setReadable() function if there is an OrgDb
available for the species.
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
kk <- enrichKEGG(gene = gene,
organism = 'hsa',
pvalueCutoff = 0.05)
head(kk)
## ID Description ## hsa04110 hsa04110 Cell cycle ## hsa04114 hsa04114 Oocyte meiosis ## hsa04218 hsa04218 Cellular senescence ## hsa04061 hsa04061 Viral protein interaction with cytokine and cytokine receptor ## hsa03320 hsa03320 PPAR signaling pathway ## hsa04914 hsa04914 Progesterone-mediated oocyte maturation ## GeneRatio BgRatio pvalue p.adjust qvalue ## hsa04110 11/94 126/8142 1.829412e-07 3.841764e-05 3.774365e-05 ## hsa04114 10/94 131/8142 2.368439e-06 2.486861e-04 2.443231e-04 ## hsa04218 10/94 156/8142 1.135672e-05 7.949704e-04 7.810235e-04 ## hsa04061 8/94 100/8142 1.821466e-05 9.562698e-04 9.394931e-04 ## hsa03320 7/94 75/8142 2.285993e-05 9.601169e-04 9.432728e-04 ## hsa04914 7/94 102/8142 1.651911e-04 5.781690e-03 5.680256e-03 ## geneID Count ## hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232 11 ## hsa04114 991/9133/983/4085/51806/6790/891/9232/3708/5241 10 ## hsa04218 2305/4605/9133/890/983/51806/1111/891/776/3708 10 ## hsa04061 3627/10563/6373/4283/6362/6355/9547/1524 8 ## hsa03320 4312/9415/9370/5105/2167/3158/5346 7 ## hsa04914 9133/890/983/4085/6790/891/5241 7
7.3 KEGG pathway GSEA
kk2 <- gseKEGG(geneList = geneList,
organism = 'hsa',
minGSSize = 120,
pvalueCutoff = 0.05,
verbose = FALSE)
head(kk2)
## ID Description setSize
## hsa05169 hsa05169 Epstein-Barr virus infection 193
## hsa04613 hsa04613 Neutrophil extracellular trap formation 130
## hsa05166 hsa05166 Human T-cell leukemia virus 1 infection 202
## hsa04510 hsa04510 Focal adhesion 190
## hsa04218 hsa04218 Cellular senescence 141
## hsa05170 hsa05170 Human immunodeficiency virus 1 infection 189
## enrichmentScore NES pvalue p.adjust qvalues rank
## hsa05169 0.4335010 1.962039 5.505487e-08 4.239225e-06 2.839672e-06 2820
## hsa04613 0.4496569 1.934454 6.345081e-06 2.442856e-04 1.636363e-04 2575
## hsa05166 0.3893613 1.755469 1.357530e-05 3.484327e-04 2.333999e-04 1955
## hsa04510 -0.4169068 -1.691234 2.771385e-05 5.334917e-04 3.573629e-04 2183
## hsa04218 0.4115945 1.783044 5.168200e-05 7.959028e-04 5.331406e-04 1155
## hsa05170 0.3711150 1.677643 1.118012e-04 1.229814e-03 8.237986e-04 3178
## leading_edge
## hsa05169 tags=39%, list=23%, signal=31%
## hsa04613 tags=37%, list=21%, signal=30%
## hsa05166 tags=26%, list=16%, signal=22%
## hsa04510 tags=27%, list=17%, signal=22%
## hsa04218 tags=17%, list=9%, signal=16%
## hsa05170 tags=38%, list=25%, signal=29%
## core_enrichment
## hsa05169 3627/890/6890/9636/898/9134/6502/6772/3126/3112/4609/917/5709/1869/3654/919/915/4067/4938/864/4940/5713/5336/11047/3066/54205/1871/578/1019/637/916/3383/4939/10213/23586/4793/5603/7979/7128/6891/930/5714/3452/6850/5702/4794/7124/3569/7097/5708/2208/8772/3119/5704/7186/5971/3135/1380/958/5610/4792/10018/8819/3134/10379/9641/1147/5718/6300/3109/811/5606/2923/3108/5707/1432
## hsa04613 820/366/51311/64581/3015/85236/55506/8970/8357/1535/2359/5336/4688/92815/3066/8336/292/1991/3689/8345/5603/4689/5880/10105/1184/6404/3018/6850/5604/3014/7097/1378/8290/1536/834/5605/1183/728/2215/8335/5594/9734/3674/5578/5582/7417/8331/6300
## hsa05166 991/9133/890/4085/7850/1111/9232/8061/701/9700/898/4316/9134/3932/3559/3126/3112/4609/3561/917/1869/1029/915/114/2005/5902/55697/1871/1031/2224/292/1019/3689/916/3383/11200/706/3600/6513/3601/468/5604/7124/1030/3569/4049/4055/10393/3119/5901/5971/1959/3135
## hsa04510 5595/5228/7424/1499/4636/83660/7059/5295/1288/23396/3910/3371/3082/1291/394/3791/7450/596/3685/1280/3675/595/2318/3912/1793/1278/1277/1293/10398/55742/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/1287/3679/7060/3479/10451/80310/1311/1101
## hsa04218 2305/4605/9133/890/983/51806/1111/891/993/3576/1978/898/9134/4609/1869/1029/22808/1871/5499/91860/292/1019/11200/1875
## hsa05170 9133/9582/983/51806/1111/891/6890/200315/917/3654/919/915/5336/54205/91860/578/995/25939/637/1234/916/5603/2792/5880/6891/6921/3452/5604/7124/9978/7097/7852/8772/1174/7186/3135/164/60489/2787/356/7133/5605/27350/6199/1642/5594/4792/3134/5578/4893/8454/5582/2786/1147/3984/6300/200316/811/5606/2923/4775/162/1432/2784/836/5747/5058/3106/2770/5534/5579/4615
7.4 KEGG module ORA. 这里的module指的是msigdb种收集的其他kegg内容,共有
KEGG Module is a collection of manually defined function units. In some situation, KEGG Modules have a more straightforward interpretation.
mkk <- enrichMKEGG(gene = gene,
organism = 'hsa',
pvalueCutoff = 1,
qvalueCutoff = 1)
head(mkk)
3.KEGG 分类备注
KEGG Pathway可以进行分类汇总。
在KEGG官网(KEGG PATHWAY Database)中,将通路分成了7类:
1. Metabolism (代谢)
2. Genetic Information Processing (遗传信息处理)
3. Environmental Information Processing (环境信息处理)
4. Cellular Processes (细胞过程)
5. Organismal Systems (生物系统)
6. Human Diseases (人类疾病)
7. Drug Development (药物开发)
做出一个分类汇总图,这样可以在更上一层级研究pathway。主要应用于特殊物种的denovo测序注释,也可以应用于我们常规的pathway富集分析。
7.5 KEGG module gene set enrichment analysis
mkk2 <- gseMKEGG(geneList = geneList,
organism = 'hsa',
pvalueCutoff = 1)
head(mkk2)
通路可视化
library("pathview")
hsa04110 <- pathview(gene.data = geneList,
pathway.id = "hsa04110",
species = "hsa",
limit = list(gene=max(abs(geneList)), cpd=1))
browseKEGG(kk, 'hsa04110')