Gene Expression Data Analysis
14 Apr 2022 23:28
I won't try to explain what gene expression is or why it's important here (see Signal Transduction, Gene Expression, and Control of Metabolism instead). This notebook is to collect references on the analysis of gene expression data, particularly using statistical or machine learning techniques. I'm especially interested in methods of recovering the structure of the regulatory network from data, e.g. through application of graphical model techniques.
Aggregation. The important papers by Chu et al. and by Wimberly et al. (see below) reveal a major obstacle in the way of hopes for using graphical model methods. This is that the data in gene expression experiments is typically obtained not from one cell but from many hundreds or thousands, and the conditional independence relations that graphical models seek to determine are not, in general, preserved under such aggregation. (Chu et al. develop this point theoretically, and Wimberly et al. show that existing structure-learning methods fail on aggregated data from reasonable simulation models.) Having only just read the papers, it's not clear to me where this leaves us. One approach, which perhaps betrays my background as a physicist, would be to try to artificially synchronize the cells before measuring expression levels. More subtle and statistical approaches may be possible. Clearly, a very significant issue. (Thanks to Tom Heiman for letting me know about these papers.)
See also: Biochemical Network Evolution; Bioinformatics; Complex Networks; Molecular Biology
- Recommended (PNAS = Proceedings of the National Academy
of Sciences (USA)):
- Genevera I. Allen, Zhandong Liu, "A Log-Linear Graphical Model for Inferring Genetic Networks from High-Throughput Sequencing Data", arxiv:1204.3941
- Zvi Bar-Joseph, "Analyzing time series gene expression data", Bioinformatics 20 (2004): 2493--2503
- Allister Bernard and Alexander J. Hartemink, "Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types of Data" [An interesting approach, but I think partial observability is going to be a killer here. PDF preprint]
- Tianjiao Chu, Clark Glymour, Richard Scheines and Peter Spirtes, "A Statistical Problem for Inference to Regulatory Structure from Associations of Gene Expression Measurements with Microarrays", Bioinformatics 19 (2003): 1147--1152
- Patrik D'haeseleer, Reconstructing Gene Networks from Large Scale Gene Expression Data, Ph.D. thesis, University of New Meixco, 2000 [on-line]
- Scott Gaffney and Padhraic Smyth, "Joint Probabilistic Curve Clustering and Alignment" in NIPS 2004 [PDF preprint. General, but intended in the first instance for gene expression data.]
- Neal S. Holter, Amos Maritan, Marek Cieplak, Nina V. Fedoroff and Jayanth R. Banavar, "Dynamic Modeling of Gene Expression Data," cond-mat/0102267 = PNAS 98 (2001): 1693--1698
- Neal S. Holter, Madhusmita Mitra, Amos Maritan, Marek Cieplak, Jayanth R. Banavar and Nina V. Fedoroff, "Fundamental Patterns Underlying Gene Expression Profiles: Simplicity from Complexity," PNAS 97 (2000): 8409--8414 [abstract]
- Adrián López García de Lomana, Qasim K. Beg, G. de Fabritiis and Jordi Villà-Freixa, "Statistical Analysis of Global Connectivity and Activity Distributions in Cellular Networks", arxiv:1004.3138
- Manul Middendorf, Etay Ziv and Chris Wiggins, "Inferring Network Mechanisms: The Drosophila melanogaster Protein Interaction Network", q-bio.QM/0408010 [Commented on under Complex Networks]
- Chris J. Oates and Sach Mukherjee, "Network Inference and Biological Dynamics", Annals of Applied Statistics 6 (2012): 1209--1235, arxiv:1112.1047
- Frank C. Wimberly, Thomas Heiman, Joseph Ramsey and Clark Glymour, "Experiments on the Accuracy of Algorithms for Inferring the Structure of Genetic Regulatory Networks from Microarray Expression Levels" [PDF preprint]
- To read:
- Pierre Baldi and G. Wesley Hatfield, DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling
- Ziv Bar-Joseph, Georg Gerber, Itamar Simon, David K. Gifford, and Tommi S. Jaakkola, "Comparing the continuous representation of time-series expression profiles to identify differentially expressed genes ", PNAS 100 (2003): 10146--10151
- Grégory Batt, Michel Page, Irene Cantone, Gregor Goessler, Pedro T. Monteiro, Hidde De Jong , "Efficient parameter search for qualitative models of regulatory networks using symbolic model checking", arxiv:1005.2107
- Sven Bergmann, Jan Ihmels and Naama Barkai, "Iterative signature algorithm for the analysis of large-scale gene expression data," Physical Review E 67 (2003): 031902
- David R. Bickel, "Selecting an optimal rejection region for multiple testing: A decision-theoretic alternative to FDR control, with an application to microarrays," math.PR/0212028
- D. R. Bickel, Z. Montazeri, P.-C. Hsieh, M. Beatty, S. J. Lawit, N. J. Bate, "Gene network reconstruction from transcriptional dynamics under kinetic model uncertainty: a case for the second derivative", Bioinformatics 25 (2009): 772--779, arxiv:0710.4127
- Justin Bleich, Adam Kapelner, Edward I. George, Shane T. Jensen, "Variable selection for BART: An application to gene regulation", Annals of Applied Statistics 8 (2014): 1750--1781, arxiv:1310.4887
- Anne-Laure Boulesteix and Gerhard Tutz, "Identification of interaction patterns and classification with applications to microarray data", Computational Statistics and Data Analysis 50 (2006): 783--802 [PDF reprint]
- Peter M. Bowers, Shawn J. Cokus, David Eisenberg and Todd O. Yeates, "Use of Logic Relationships to Decipher Protein Network Organization", Science 306 (2004): 2246--2249
- Jean-Philippe Brunet, Pablo Tamayo, Todd R. Golub and Jill P. Mesirov, "Metagenes and molecular pattern discovery using matrix factorization", PNAS 101 (2004): 4164--4169
- A. J. Butte and I. S. Kohane, "Mutual Information Relevance Networks: Functional Genomic Clustering Using Pairwise Entropy Measurements" [online
- Ramon Diaz-Uriarte, "A simple method for finding molecular signatures from gene expression data", q-bio.QM/0401043
- Ramon Diaz-Uriarte and Sara Alvarez de Andres, "Variable selection from random forests: application to gene expression data", q-bio.MN/0503025
- Diego di Bernardo, Michael J Thompson, Timothy S Gardner, Sarah E Chobot, Erin L Eastwood, Andrew P Wojtovich, Sean J Elliott, Scott E Schaus, and James J Collins, "Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks", Nature Biotechnology 23 (2005): 377--383
- Christopher J. Easley, James M. Karlinsey, Joan M. Bienvenue, Lindsay A. Legendre, Michael G. Roper, Sanford H. Feldman, Molly A. Hughes, Erik L. Hewlett, Tod J. Merkel, Jerome P. Ferrance, and James P. Landers, "A fully integrated microfluidic genetic analysis system with sample-in-answer-out capability", Proceedings of the National Academy of Sciences (USA) 103 (2006): 19272--19277
- Bradley Efron and Robert Tibshirani, "On testing the significance of sets of genes", math.ST/0610667
- David P. Enot, Manfred Beckmann, David Overy, and John Draper, "Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals", Porceedings of the National Academy of Sciences 103 (2006): 14865--14870 [Yes, the metabolome isn't, strictly, "gene expression"; so?]
- Lorenzo Farina and Ilaria Mogno, "A Fast Reconstruction Algorithm for Gene Networks", q-bio.QM/0401044 [Assuming the underlying system is a linear time-invariant network --- hah!]
- Nir Friedman, Long Cai and X. Suney Xie, "Linking Stochastic Dynamics to Population Distribution: An Analytical Framework for Gene Expression", Physical Review Letters 97 (2006): 168302 [They seem to make some very strong probabilistic assumptions here, and possibly one of spatial uniformity across the cell as well. Think carefully about whether these are really required]
- Mika Gustafsson, Michael Hornquist and Anna Lombardi, "Large-scale reverse engineering by the Lasso", q-bio.MN/0403012
- Jean Hausser and Korbinian Strimmer, "Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks", Journal of Machine Learning Research 10 (2009): 1469--1484
- John Hertz
- "Statistical issues in reverse engineering of genetic networks"
- Mattias Wahde and JH, "Modeling Genetic Regulatory Dynamics in Neural Development"
- Laurent Jacob, Pierre Neuvial, and Sandrine Dudoit, "More power via graph-structured tests for differential expression of gene networks", Annals of Applied Statistics 6 (2012): 561--600
- Gareth M. James, Chiara Sabatti, Nengfeng Zhou, and Ji Zhu, "Sparse regulatory networks", Annals of Applied Statistics 4 (2010): 663--686
- Shane T. Jensen, Guang Chen, Christian J. Stoeckert Jr, "Bayesian Variable Selection and Data Integration for Biological Regulatory Networks", math.ST/0610034
- C. Kendziorski, R. A. Irizarry, K.-S. Chen, J. D. Haag and M. N. Gould, "On the utility of pooling biological samples in microarray experiments", PNAS 102 (2005): 4252--4257 [Open access]
- R. Khanin, V. Vinciotti and E. Wit, "Reconstructing repressor protein levels from expression of gene targets in Escherichia coli", Proceedings of the National Academy of Sciences (USA) 103 (2006): 18592--18596
- Isaac S. Kohane, Alvin T. Kho and Atul J. Butte, Microarrays for an Integrative Genomics
- Nicole Kraemer, Juliane Schaefer, Anne-Laure Boulesteix, "Regularized estimation of large-scale gene association networks using graphical Gaussian models", arxiv:0905.0603
- Sophie Lèbre, "Inferring dynamic genetic networks with low order independencies", arxiv:0704.2551
- Su-In Lee, Dana Pe'er, Aimée M. Dudley, George M. Church and
Daphne Koller, "Identifying regulatory mechanisms using individual variation reveals key role for chromatin modification", Porceedings of the
National Academy of Sciences (USA) 103 (2006):
14062--14067 [Open access]
- Michele Leone, Sumedha, Martin Weigt, "Clustering by soft-constraint affinity propagation: Applications to gene-expression data", arxiv:0705.2646
- Wentian Li, Young Ju Shu and Jingshan Zhang, "Does Logarithm Transformation of Microarray Data Affect Ranking Order of Differentially Expressed Genes?", q-bio.QM/0606018
- Adam A. Margolin, Ilya Nemenman, Chris Wiggins, Gustavo Stolovitzky and Andrea Califano, "On the Reconstruction of Interaction Networks with Applications to Transcriptional Regulation", q-bio.MN/0410036
- Adam A. Margolin, Ilya Nemenman, Katia Basso, Ulf Klein, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera and Andrea Califano, "ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context", q-bio.MN/0410037
- Vinicius Diniz Mayrink and Joseph Edward Lucas, "Sparse latent factor models with interactions: Analysis of gene expression data", Annals of Applied Statistics 7 (2013): 799--822
- Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund and Christina Leslie
- "Predicting Genetic Regulatory Response using Classification: Yeast Stress Response", q-bio.QM/0406016
- "Predicting Genetic Regulatory Response Using Classification", q-bio.MN/0411028 = Proceedings of the Twelfth International Conference on Intelligent Systems for Molecular Biology (ISMB 2004) I232--I240
- Radhakrishnan Nagarajan, Jane E. Aubin and Charlotte A. Peterson, " Modeling Genetic Networks from Clonal Analysis", q-bio.MN/0412047 = Journal of Theoretical Biology 230 (2004): 359--373
- Ilya Nemenman, "Information theory, multivariate dependence, and genetic network inference", q-bio.QM/0406015
- Bernhard O. Palsson, Systems Biology: Properties of Reconstructed Networks
- Giovanni Parmigiani, The Analysis of Gene Expression Data
- Ashoka D. Polpitiya, J. Perren Cobb and Bijoy K. Ghosh, "Genetic Regulatory Networks and Co-Regulation of Genes: A Dynamic Model Based Approach", pp. 291ff of Wijesuriya P. Dayawansa, Anders Lindquist, Yishao Zhou (eds.), New Directions and Applications in Control Theory (Springer-Verlag, 2005)
- Otto Pulkkinen, Johannes Berg, "Dynamics of gene expression under feedback", arxiv:0807.3521
- José M. Ranz and Carlos A. Machado, "Uncovering evolutionary patterns of gene expression using microarrays", Trends in Ecology and Evolution 21 (2006): 29--37
- Karen Sachs, Omar Perez, Dana Pe'er, Douglas A. Lauffenburger and Garry P. Nolan, "Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data", Science 308 (2005): 523--529 [This is protein interaction, not gene regulation, but still...]
- Areejit Samal, Olivier C. Martin, "Randomizing genome-scale metabolic networks", arxiv:1012.1473
- Robert Tibshirani and Larry Wasserman, "Correlation-sharing for detection of differential gene expression", math.ST/0608061
- Jean-Philippe Vert, Jian Qiu and William Stafford Noble, "Metric learning pairwise kernel for graph inference", q-bio.QM/0610040
- Kai Wang, Nilanjana Banerjee, Adam Margolin, Ilya Nemenman, Katia Basso, Riccardo Favera and Andrea Califano, "Conditional Network Analysis Identifies Candidate Regulator Genes in Human B Cells", q-bio.MN/0411003
- Chris Wiggins and Ilya Nemenman, "Process Pathway Inference via Time Series Analysis," physics/0206031
- Roy Wilds and Leon Glass, "Contrasting methods for symbolic analysis of biological regulatory networks", Physical Review E 80 (2009): 062902
- Johannes Wollbold, "Attribute Exploration of Discrete Temporal Transitions", q-bio/0701009
- Matthew A. Zapala and Nicholas J. Schork, "Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables", Proceedings of the National Academy of Sciences (USA) 103 (2006): 19430--19435
- Shu-Dong Zhang and Timothy W. Gant, "Effect of pooling samples on the efficiency of comparative studies using microarrays", q-bio.QM/0510024
- Etay Ziv, Manuel Middendorf and Chris Wiggins, "An Information-Theoretic Approach to Network Modularity", q-bio.QM/0411033