# Impact of tuberculosis

Worldwide 9.6 million new cases and 1.4 million deaths annualy; almost 1/3 of the worlds population may be infected.

# We needed a package for enrichment analysis

• Custom variable sets (expression modules, metabolic profiling)
• Statistical test(s):
• not relying on arbitrary thresholds
• preferably no bootstrapping
• detached from differential analysis
• pluggable into ML and MDS (e.g. PCA)
• Integrating in our R pipelines
• Highly flexible – diverse projects with different problem settings
• “Bird’s eye” visualizations for multiple analyses

(xkcd.com/927)

# Introducing tmod

“Although there is no particular novelty in the methods, the package addresses the right questions and appears to do a good job on real biological analysis.” (Anonymous reviewer)

– Perfect!

# tmod features

• CERNO: a new(-ish) statistical test for continuous enrichments (Yamaguchi et al. 2008, not implemented elsewhere)
• MSD: a new method for ordering genes
• “panel plots” – visualisation of (serial) gene set enrichment results
• prepackaged gene sets from Chaussabel et al. (2008) and Li et al. (2014) and metabolic profiling clustering from Weiner et al. (2012)
• facilities for combining multivariate techniques / dimensionality reduction techniques / correlation analysis / machine learning with enrichment analysis

# CERNO test: a variant of Fisher’s exact test

Some enrichment tests (such as the hypergeometric test) rely on arbirtrary tresholds to divide the genes into “differentially expressed” and “background” (or equivalent sets). It is easy to run a statistical test on such a setup, however it is problematic: the number of significantly regulated genes depends on the statistical power, i.e. for example on the number of samples.

Better tests yet are independent of arbitrary thresholds. Examples include

• Randomization approaches (such as GSEA)
• ANOVA-like approaches
• Mann-Whitney U statistic

However, all methods that we have tested were problematic (eg. too conservative; too lax). Most importantly, while results were robust for large sample sets, they were not “down-scalable”: results for small sample sizes rarely reflected the results for large sample sizes.

# CERNO: Ranks can be treated as probabilities

$P(rank(g_j) < rank(g_i)) = \frac{rank(g_i)}{N}$

Where $$N$$ is the total number of genes.

# We apply Fisher’s method to ranks

$\mathbf{CERNO} = -2\cdot \sum_{i=1}^{N}\ln(\frac{rank(g_i)}{N})$

The statistics has a $$\chi^2$$ distribution with $$2\cdot N$$ degrees of freedom.

First, second and third quartiles of number of modules recovered by the different statistical tests in dependence of the sample size in 100 random sample replicates.

# How to order genes?

• Order by p-values (common approach).
• Genes with strong expression tend to have lower p-values even if log-fold changes are small
• Order by (absolute) log fold change
• Genes with weak expression (near background) can have huge log fold changes despite lack of significance

# MSD – Minimal Significant Difference

$\text{MSD} = \begin{cases} CI.L & \text{if logFC} > 0\\ -CI.R & \text{if logFC} < 0\\ \end{cases}$

First, second and third quartiles of number of modules recovered by the different approaches in dependence of the sample size in 100 random sample replicates.

In comparisons with other ordering techniques, MSD has an exceptional specificity, while maintaining good sensitivity (Joanna Żyła, personal communication).

# Basic visualization of enrichment

In an U-test, the U statistic is (almost) the same as the Area Under Curve:

$r=1-\frac{2\cdot U}{n_1\cdot n_2} = 1 - 2\cdot\text{AUC}$

(r is the effect size for an U-test)

evidencePlot(l=tt$GENE_SYMBOL, m="LI.M11.0") # Panel plots showing effect sizes, p-values and direction of change # A more complex example # Functional multivariate analysis # Multivariate analysis + enrichment = Functional multivariate analysis (FMA) Combination of multivariate techniques such as PCA and functional enrichment analysis can circumvent the need for analysis of differential expression. A primer on FMA will be presented here. Question in FMA: What do these components mean? # Enrichment in component 4 This approach works well also with other multivariate analyses such as independent component analysis (ICA), partial least squares (PLS) or correspondance analysis (CA). Directly combining multivariate analyses with gene set enrichment allows us to achieve the same results without involving a direct group - to - group comparison. This makes it especially suitable for exploratory analyses. # tmod Web Interface # Concluding remarks • Gene set enrichment analysis is a versatile tool for functional annotation • Functional multivariate analysis can replace differential expression analysis • tmod: R package for BTM and GS enrichment analysis, available from http://bioinfo.mpiib-berlin.mpg.de/tmod/ and CRAN • tmod allows functional multivariate analysis and serial enrichment analyiss • features several visualization tools • Where to find me: MPIIB january@mpiib-berlin.mpg.de # Conributors • Teresa Domaszewska (MPIIB) – co-author (see our poster) We would like to thank Emilio Siena (GSK Vaccines) for new ideas, as well Joanna Żyła, Michał Marczyk and Joanna Polańska (Politechnika Śląska) for helpful discussions and the BMBF / InfectControl2020 for supporting Teresa Domaszewska. # Appendix 1: About this presentation # About this presentation This is an Rmarkdown document; you can view the code on http://bioinfo.mpiib-berlin.mpg.de/tmod/ You can download the source code of this presentation on the tmod web page, http://bioinfo.mpiib-berlin.mpg.de/tmod/. To recreate this presentation, download the full presentation package and unzip it. Install the required packages (knitr for R and pandoc). Run the following command from inside the package archive. Commands: Rscript -e 'knitr::knit("weiner_bioinfo_2015_06_23.Rmd")' pandoc -s -S -t revealjs weiner_bioinfo_2015_06_23.md -o weiner_bioinfo_2015_06_23.html \ --mathjax='http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' \ --css css/mytheme.css \ --slide-level 2 -V theme=blood (Note: for an offline version, download MathJax and modify the –mathjax option) To extract the code from this presentation, save it as “test.Rmd” and run Rscript -e 'knitr::purl("tmod.Rmd")' # Printing: This only works properly in Google Chromium; see reveal.js documentation To print, follow the link below and press Ctrl-P; don’t worry if the slides appear to overlap – they will look fine on the print preview. Print # Appendix 2: Serial analysis of enrichment with tmod tmod has been designed as a package for testing the enrichment of blood transcriptional modules. Therefore, tmod contains two sets of blood transcriptional module definitions; however, it can be used with any arbitrary gene set definition (e.g. GSEA/MSigDB) or high throughput data type (e.g. metabolomics) tmod implements HG / U / CERNO tests, functional multivariate analyses, serial analysis / visualization and more. Availability: http://bioinfo.mpiib-berlin.mpg.de/tmod/ # Example: MFA with R and tmod Data set Egambia: GEO GSE28623. library(tmod) data(Egambia) head(Egambia) ## GENE_SYMBOL GENE_NAME EG NID NID NID NID NID NID NID NID ## 34 C19orf15 chromosome 19 open reading frame 15 57828 3.2618218 1.339813 2.462880 3.9795714 4.626591 4.253879 2.6972224 3.804575 ## 36 UNQ9368 RTFV9368 643036 1.5671748 4.952589 4.065492 4.6410409 3.207639 5.552805 4.1901659 3.257324 ## 41 ADORA3 adenosine A3 receptor 140 6.2246027 5.078945 3.862936 6.1976618 5.410801 5.794503 5.0215809 8.017599 ## 44 CDH6 cadherin 6, type 2, K-cadherin (fetal kidney)" 1004 0.8328559 3.585615 2.975072 0.6475284 1.222587 0.143559 -0.7720002 0.615626 ## 52 VASH1 vasohibin 1 22846 11.3952226 12.098981 13.057666 13.0795387 11.182162 10.747176 12.9036129 12.261216 ## 62 MAB21L2 mab-21-like 2 (C. elegans) 10586 5.7530317 7.517327 6.777747 7.0907261 6.614020 10.917492 8.9160131 5.323171 ## NID NID NID NID NID NID NID TB TB TB TB TB TB TB TB ## 34 4.617986 3.033595 3.1866326 3.6506719 3.787375 3.019342 2.795293 3.020139 2.269359 3.325500 4.0871459 4.723224 3.599277 1.571850 3.6313736 ## 36 4.786995 3.091925 2.2736422 4.1327518 3.934754 3.077131 6.428547 4.655419 6.460367 4.607918 4.7004430 5.967757 5.641816 5.006141 5.2302712 ## 41 6.878103 4.702415 7.6848512 5.2048066 4.836591 4.965997 8.234983 5.072141 4.356401 7.043360 7.3670729 5.805350 6.862796 4.134708 5.5752895 ## 44 2.589377 3.307486 0.7026353 0.8349973 3.951534 2.112500 1.223633 1.477250 5.159376 1.517965 0.4115781 1.584342 1.148472 1.710048 0.8883513 ## 52 11.376962 13.061029 13.0915988 12.0304966 11.980200 12.323327 11.076847 13.187186 13.122720 12.188427 10.4137276 11.128346 10.889193 11.062450 12.9295464 ## 62 7.167419 6.299295 5.8910289 5.4252899 5.265659 6.367774 6.691451 6.351070 6.124325 5.434126 5.5917154 7.298053 6.607853 5.594411 5.2920568 ## TB TB TB TB TB TB TB ## 34 3.962293 2.080173 3.750405 2.248475 4.148280 4.203384 4.319223 ## 36 5.551801 5.021816 5.338259 6.258222 6.383069 5.995486 5.203686 ## 41 7.689227 6.004437 5.928957 5.178725 5.661376 6.611350 6.008429 ## 44 0.977040 1.174805 1.764985 3.400484 2.486234 1.115761 1.454525 ## 52 11.921874 10.830077 11.106051 13.062118 10.608390 12.421606 13.269083 ## 62 11.736111 5.887298 5.672741 6.621244 4.729986 5.288187 8.995319 pca <- prcomp(t(Egambia[,-c(1:3)]), scale.=TRUE) names(pca) ## [1] "sdev" "rotation" "center" "scale" "x" head(pca$x[,1:5])
##             PC1        PC2        PC3          PC4        PC5
## NID   -27.40722 -13.745073  21.755484  -7.09238147   3.427730
## NID.1  60.44502 -10.323684 -20.091148 -16.49059391  28.610291
## NID.2  67.86063  -2.049866  -5.840748   0.03206066  -4.409519
## NID.3  -5.69328  41.378405 -16.466891 -18.00896080  -5.135397
## NID.4 -25.59922 -22.852422  22.133902 -18.13982412 -13.619485
## NID.5 -39.83765  -8.461881 -50.897837   9.51198824   3.981510
head(pca$rotation[,1:5]) ## PC1 PC2 PC3 PC4 PC5 ## 34 -0.018481790 0.0024852364 -0.005385593 0.002711500 -0.025291642 ## 36 0.003759772 -0.0010608658 -0.018658252 0.041998920 0.006653159 ## 41 -0.016103961 0.0002934259 -0.009930961 -0.008813419 -0.011689654 ## 44 0.021992983 -0.0073125000 0.005115505 0.015148515 0.001656177 ## 52 0.014524608 0.0258657287 0.015226449 -0.009626087 -0.001871263 ## 62 -0.001791174 0.0081258050 -0.016757250 0.001834450 -0.009573289 # Enrichment for each component l <- Egambia$GENE_SYMBOL
encfunc <- function(r) {
o <- order(abs(r), decreasing=TRUE)
tmodCERNOtest(l[o])
}
res <- apply(pca$rotation[,1:10], 2, encfunc) head(res[[4]]) ## ID Title cerno N1 AUC cES P.Value adj.P.Val ## LI.M37.0 LI.M37.0 immune activation - generic cluster 454.88172 100 0.7188785 2.274409 6.749965e-22 2.335488e-19 ## LI.M11.0 LI.M11.0 enriched in monocytes (II) 118.06755 20 0.7734395 2.951689 1.234985e-09 2.136524e-07 ## LI.M165 LI.M165 enriched in activated dendritic cells (II) 101.09999 19 0.7562436 2.660526 1.234533e-07 1.423828e-05 ## LI.M37.1 LI.M37.1 enriched in neutrophils (I) 77.04015 12 0.8671929 3.210006 1.789217e-07 1.547673e-05 ## LI.M16 LI.M16 TLR and inflammatory signaling 50.11235 5 0.9923252 5.011235 2.545057e-07 1.761180e-05 ## LI.M75 LI.M75 antiviral IFN signature 67.61164 10 0.9007267 3.380582 4.448287e-07 2.565179e-05 # Visualization tmodPanelPlot(res, filter.empty.rows=TRUE) # Genes with positive / negative weights? qfnc <- function(r) quantile(r, 0.75) qqs <- apply(pca$rotation[,1:10], 2, qfnc)
pie <- tmodDecideTests(l, lfc=pca\$rotation[,1:10], lfc.thr=qqs)
tmodPanelPlot(res, pie=pie, pie.style="rug", grid="between")

# Functional Principal Component Analysis (PCA)

In PCA, the $$N \times K$$ matrix $$\mathbf{X}$$ of $$N$$ samples and $$K$$ variables (e.g. genes) is rotated, which results in a new matrix, $$\mathbf{Y}$$, with $$N$$ samples and $$J$$ principal components (PCs).

Effectively, a $$K \times J$$ matrix $$\mathbf{W}$$ is calculated, such that

$\mathbf{X}\times \mathbf{W} = \mathbf{Y}$

$\mathbf{X}\times \mathbf{W} = \mathbf{Y}$

Each column of $$\mathbf{X}$$ is a principal component. Each row corresponds to one sample.

A value for a given PC $$j$$ and a given sample $$n$$ is calculated as

$y_{n,j} = \sum_{k=1}^{K} w_{j,k}\cdot x_{k,n}$

The terms $$w_{j,k}$$ are variable- (or: gene-) specific weights or loadings for each component $$j$$.

$y_{(n,j)} = \sum_{k=1}^{K} w_{k,j}\cdot x_{k,n}$

The larger the absolute value of $$w_{k,j}$$, the more impact this gene has on the $$j$$-th principal component.

We can sort the genes by their weight in a component. Since as a result we get a sorted list of genes, we can apply a continuous enrichment algorithm.