Worldwide 9.6 million new cases and 1.4 million deaths annualy; almost 1/3 of the worlds population may be infected.

- Custom variable sets (expression modules, metabolic profiling)
- Statistical test(s):
- not relying on arbitrary thresholds
- preferably no bootstrapping
- detached from differential analysis
- pluggable into ML and MDS (e.g. PCA)

- Integrating in our R pipelines
- Highly flexible – diverse projects with different problem settings
- “Bird’s eye” visualizations for multiple analyses

(xkcd.com/927)

“Although there is no particular novelty in the methods, the package addresses the right questions and appears to do a good job on real biological analysis.” (Anonymous reviewer)

– Perfect!

- CERNO: a new(-ish) statistical test for continuous enrichments (Yamaguchi et al. 2008, not implemented elsewhere)
- MSD: a new method for ordering genes
- “panel plots” – visualisation of (serial) gene set enrichment results
- prepackaged gene sets from Chaussabel et al. (2008) and Li et al. (2014) and metabolic profiling clustering from Weiner et al. (2012)
- facilities for combining multivariate techniques / dimensionality reduction techniques / correlation analysis / machine learning with enrichment analysis

Some enrichment tests (such as the hypergeometric test) rely on arbirtrary tresholds to divide the genes into “differentially expressed” and “background” (or equivalent sets). It is easy to run a statistical test on such a setup, however it is problematic: the number of significantly regulated genes depends on the statistical power, i.e. for example on the number of samples.

Better tests yet are independent of arbitrary thresholds. Examples include

- Randomization approaches (such as GSEA)
- ANOVA-like approaches
- Mann-Whitney U statistic

However, all methods that we have tested were problematic (eg. too conservative; too lax). Most importantly, while results were robust for large sample sets, they were not “down-scalable”: results for small sample sizes rarely reflected the results for large sample sizes.

\[P(rank(g_j) < rank(g_i)) = \frac{rank(g_i)}{N}\]

Where \(N\) is the total number of genes.

\[\mathbf{CERNO} = -2\cdot \sum_{i=1}^{N}\ln(\frac{rank(g_i)}{N})\]

The statistics has a \(\chi^2\) distribution with \(2\cdot N\) degrees of freedom.

First, second and third quartiles of number of modules recovered by the different statistical tests in dependence of the sample size in 100 random sample replicates.

- Order by p-values (common approach).
- Genes with strong expression tend to have lower p-values even if log-fold changes are small

- Order by (absolute) log fold change
- Genes with weak expression (near background) can have huge log fold changes despite lack of significance

\[ \text{MSD} = \begin{cases} CI.L & \text{if logFC} > 0\\ -CI.R & \text{if logFC} < 0\\ \end{cases} \]

First, second and third quartiles of number of modules recovered by the different approaches in dependence of the sample size in 100 random sample replicates.

In comparisons with other ordering techniques, MSD has an exceptional specificity, while maintaining good sensitivity (Joanna Żyła, personal communication).

In an U-test, the U statistic is (almost) the same as the Area Under Curve:

\[r=1-\frac{2\cdot U}{n_1\cdot n_2} = 1 - 2\cdot\text{AUC}\]

(r is the effect size for an U-test)

`evidencePlot(l=tt$GENE_SYMBOL, m="LI.M11.0")`

Combination of multivariate techniques such as PCA and functional enrichment analysis can circumvent the need for analysis of differential expression. A primer on FMA will be presented here.

Question in FMA: *What do these components mean?*

This approach works well also with other multivariate analyses such as independent component analysis (ICA), partial least squares (PLS) or correspondance analysis (CA).

Directly combining multivariate analyses with gene set enrichment allows us to achieve the same results without involving a direct group - to - group comparison. This makes it especially suitable for exploratory analyses.

- Gene set enrichment analysis is a versatile tool for functional annotation
- Functional multivariate analysis can replace differential expression analysis
*tmod*: R package for BTM and GS enrichment analysis, available from http://bioinfo.mpiib-berlin.mpg.de/tmod/ and CRAN*tmod*allows functional multivariate analysis and serial enrichment analyiss- features several visualization tools
- Where to find me: MPIIB january@mpiib-berlin.mpg.de

- Teresa Domaszewska (MPIIB) – co-author (see our poster)

We would like to thank Emilio Siena (GSK Vaccines) for new ideas, as well Joanna Żyła, Michał Marczyk and Joanna Polańska (Politechnika Śląska) for helpful discussions and the BMBF / InfectControl2020 for supporting Teresa Domaszewska.

This is an Rmarkdown document; you can view the code on http://bioinfo.mpiib-berlin.mpg.de/tmod/

You can download the source code of this presentation on the tmod web page, http://bioinfo.mpiib-berlin.mpg.de/tmod/.

To recreate this presentation, download the full presentation package and unzip it. Install the required packages (knitr for R and pandoc). Run the following command from inside the package archive.

Commands:

```
Rscript -e 'knitr::knit("weiner_bioinfo_2015_06_23.Rmd")'
pandoc -s -S -t revealjs weiner_bioinfo_2015_06_23.md -o weiner_bioinfo_2015_06_23.html \
--mathjax='http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' \
--css css/mytheme.css \
--slide-level 2 -V theme=blood
```

(Note: for an offline version, download MathJax and modify the –mathjax option)

To extract the code from this presentation, save it as “test.Rmd” and run

`Rscript -e 'knitr::purl("tmod.Rmd")'`

This only works properly in Google Chromium; see reveal.js documentation

To print, follow the link below and press Ctrl-P; don’t worry if the slides appear to overlap – they will look fine on the print preview.

*tmod* has been designed as a package for testing the enrichment of blood transcriptional modules. Therefore, *tmod* contains two sets of blood transcriptional module definitions; however, it can be used with any arbitrary gene set definition (e.g. GSEA/MSigDB) or high throughput data type (e.g. metabolomics)

*tmod* implements HG / U / CERNO tests, functional multivariate analyses, serial analysis / visualization and more.

Availability: http://bioinfo.mpiib-berlin.mpg.de/tmod/

Data set Egambia: GEO GSE28623.

```
library(tmod)
data(Egambia)
head(Egambia)
```

```
## GENE_SYMBOL GENE_NAME EG NID NID NID NID NID NID NID NID
## 34 C19orf15 chromosome 19 open reading frame 15 57828 3.2618218 1.339813 2.462880 3.9795714 4.626591 4.253879 2.6972224 3.804575
## 36 UNQ9368 RTFV9368 643036 1.5671748 4.952589 4.065492 4.6410409 3.207639 5.552805 4.1901659 3.257324
## 41 ADORA3 adenosine A3 receptor 140 6.2246027 5.078945 3.862936 6.1976618 5.410801 5.794503 5.0215809 8.017599
## 44 CDH6 cadherin 6, type 2, K-cadherin (fetal kidney)" 1004 0.8328559 3.585615 2.975072 0.6475284 1.222587 0.143559 -0.7720002 0.615626
## 52 VASH1 vasohibin 1 22846 11.3952226 12.098981 13.057666 13.0795387 11.182162 10.747176 12.9036129 12.261216
## 62 MAB21L2 mab-21-like 2 (C. elegans) 10586 5.7530317 7.517327 6.777747 7.0907261 6.614020 10.917492 8.9160131 5.323171
## NID NID NID NID NID NID NID TB TB TB TB TB TB TB TB
## 34 4.617986 3.033595 3.1866326 3.6506719 3.787375 3.019342 2.795293 3.020139 2.269359 3.325500 4.0871459 4.723224 3.599277 1.571850 3.6313736
## 36 4.786995 3.091925 2.2736422 4.1327518 3.934754 3.077131 6.428547 4.655419 6.460367 4.607918 4.7004430 5.967757 5.641816 5.006141 5.2302712
## 41 6.878103 4.702415 7.6848512 5.2048066 4.836591 4.965997 8.234983 5.072141 4.356401 7.043360 7.3670729 5.805350 6.862796 4.134708 5.5752895
## 44 2.589377 3.307486 0.7026353 0.8349973 3.951534 2.112500 1.223633 1.477250 5.159376 1.517965 0.4115781 1.584342 1.148472 1.710048 0.8883513
## 52 11.376962 13.061029 13.0915988 12.0304966 11.980200 12.323327 11.076847 13.187186 13.122720 12.188427 10.4137276 11.128346 10.889193 11.062450 12.9295464
## 62 7.167419 6.299295 5.8910289 5.4252899 5.265659 6.367774 6.691451 6.351070 6.124325 5.434126 5.5917154 7.298053 6.607853 5.594411 5.2920568
## TB TB TB TB TB TB TB
## 34 3.962293 2.080173 3.750405 2.248475 4.148280 4.203384 4.319223
## 36 5.551801 5.021816 5.338259 6.258222 6.383069 5.995486 5.203686
## 41 7.689227 6.004437 5.928957 5.178725 5.661376 6.611350 6.008429
## 44 0.977040 1.174805 1.764985 3.400484 2.486234 1.115761 1.454525
## 52 11.921874 10.830077 11.106051 13.062118 10.608390 12.421606 13.269083
## 62 11.736111 5.887298 5.672741 6.621244 4.729986 5.288187 8.995319
```

```
pca <- prcomp(t(Egambia[,-c(1:3)]), scale.=TRUE)
names(pca)
```

`## [1] "sdev" "rotation" "center" "scale" "x"`

`head(pca$x[,1:5])`

```
## PC1 PC2 PC3 PC4 PC5
## NID -27.40722 -13.745073 21.755484 -7.09238147 3.427730
## NID.1 60.44502 -10.323684 -20.091148 -16.49059391 28.610291
## NID.2 67.86063 -2.049866 -5.840748 0.03206066 -4.409519
## NID.3 -5.69328 41.378405 -16.466891 -18.00896080 -5.135397
## NID.4 -25.59922 -22.852422 22.133902 -18.13982412 -13.619485
## NID.5 -39.83765 -8.461881 -50.897837 9.51198824 3.981510
```

`head(pca$rotation[,1:5])`

```
## PC1 PC2 PC3 PC4 PC5
## 34 -0.018481790 0.0024852364 -0.005385593 0.002711500 -0.025291642
## 36 0.003759772 -0.0010608658 -0.018658252 0.041998920 0.006653159
## 41 -0.016103961 0.0002934259 -0.009930961 -0.008813419 -0.011689654
## 44 0.021992983 -0.0073125000 0.005115505 0.015148515 0.001656177
## 52 0.014524608 0.0258657287 0.015226449 -0.009626087 -0.001871263
## 62 -0.001791174 0.0081258050 -0.016757250 0.001834450 -0.009573289
```

```
l <- Egambia$GENE_SYMBOL
encfunc <- function(r) {
o <- order(abs(r), decreasing=TRUE)
tmodCERNOtest(l[o])
}
res <- apply(pca$rotation[,1:10], 2, encfunc)
head(res[[4]])
```

```
## ID Title cerno N1 AUC cES P.Value adj.P.Val
## LI.M37.0 LI.M37.0 immune activation - generic cluster 454.88172 100 0.7188785 2.274409 6.749965e-22 2.335488e-19
## LI.M11.0 LI.M11.0 enriched in monocytes (II) 118.06755 20 0.7734395 2.951689 1.234985e-09 2.136524e-07
## LI.M165 LI.M165 enriched in activated dendritic cells (II) 101.09999 19 0.7562436 2.660526 1.234533e-07 1.423828e-05
## LI.M37.1 LI.M37.1 enriched in neutrophils (I) 77.04015 12 0.8671929 3.210006 1.789217e-07 1.547673e-05
## LI.M16 LI.M16 TLR and inflammatory signaling 50.11235 5 0.9923252 5.011235 2.545057e-07 1.761180e-05
## LI.M75 LI.M75 antiviral IFN signature 67.61164 10 0.9007267 3.380582 4.448287e-07 2.565179e-05
```

`tmodPanelPlot(res, filter.empty.rows=TRUE)`

```
qfnc <- function(r) quantile(r, 0.75)
qqs <- apply(pca$rotation[,1:10], 2, qfnc)
pie <- tmodDecideTests(l, lfc=pca$rotation[,1:10], lfc.thr=qqs)
tmodPanelPlot(res, pie=pie, pie.style="rug", grid="between")
```

In PCA, the \(N \times K\) matrix \(\mathbf{X}\) of \(N\) samples and \(K\) variables (e.g. genes) is rotated, which results in a new matrix, \(\mathbf{Y}\), with \(N\) samples and \(J\) principal components (PCs).

Effectively, a \(K \times J\) matrix \(\mathbf{W}\) is calculated, such that

\[ \mathbf{X}\times \mathbf{W} = \mathbf{Y} \]

\[ \mathbf{X}\times \mathbf{W} = \mathbf{Y} \]

Each column of \(\mathbf{X}\) is a principal component. Each row corresponds to one sample.

A value for a given PC \(j\) and a given sample \(n\) is calculated as

\[ y_{n,j} = \sum_{k=1}^{K} w_{j,k}\cdot x_{k,n} \]

The terms \(w_{j,k}\) are variable- (or: gene-) specific *weights* or *loadings* for each component \(j\).

\[ y_{(n,j)} = \sum_{k=1}^{K} w_{k,j}\cdot x_{k,n} \]

The larger the absolute value of \(w_{k,j}\), the more impact this gene has on the \(j\)-th principal component.

We can sort the genes by their weight in a component. Since as a result we get a sorted list of genes, we can apply a continuous enrichment algorithm.