Two-sample testing in high dimensions
Estimation for high-dimensional models has been widely studied. However, uncertainty quantification remains challenging. We put forward novel methodology for two-sample testing in high dimensions (Städler and Mukherjee, JRSSB, 2016). The key idea is to exploit sparse structure in the construction of the test statistics and in p-value calculation. This renders the test effective but leads to challenging technical issues that we solve via novel theory that extends the likelihood ratio test to the high-dimensional setting. For computation we use randomized data-splitting: sparsity structure is estimated using the first half of the data, and p-value calculation is carried out using the second half. P-values from multiple splits are aggregated to give a final result. Our test is very general and applicable to any model class where sparse estimation is possible. We call the application to graphical models Differential Network. Our method is implemented in the recently released Bioconductor package nethet. Besides code for high-dimensional testing the package provides other tools for exploring heterogeneity from high-dimensional data. For example, we make a novel network-based clustering algorithm available and provide several visualization functionalities. Molecular networks play a central role in biology. An emerging notion is that networks themselves are thought to differ between biological contexts, such as cell type, tissue type, or disease state. As an example we consider protein data from The Cancer Genome Atlas. Differential Network applied to this data set provides evidence over thousands of patient samples in support of the notion that cancers differ at the protein network level.