Bioinformatics

Last update: 04 Oct 2002 14:31
First version:

An ugly name. The use of computation-intensive techniques to study biological data, especially data generated from sequencing long macromolecules (chromosomal DNA, proteins, etc.) or otherwise related to them.

Now, it so happens that I've written a whole dissertation about computation-intensive techniques for discovery patterns in sequences... This notebook will contain more, when I have more to put in it.

Things to look into: State of the art of using hidden Markov models (seems poor, frankly). Using grammatical inference to characterize sequence families, regulatory motifs, etc. Inferring metabolic or regulatory structure from large-scale expression data, especially gene chip data. Massaging gene chip data. Characterizing membrane proteins and their activity.

I've just heard that some people are using hidden Markov models to characterize gene-chip data at a single time, using some odd mapping of different genes into a serial order. This seems absurd to me, but I have it from a reliable source. If people are really doing that, there's a much better alternative easily available, namely using graphical models. Memo to self: investigate, and if there's a niche, publish!

Baldi and Brunak, Bioinformatics: The Machine Learning Approach
Sandrine Dudoit, Yee Hwa Yang, Matthew J. Callow and Terence P. Speed, "Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments," UCB Statistics Technical Report 578 [Abstract]
Neal S. Holter, Madhusmita Mitra, Amos Maritan, Marek Cieplak, Jayanth R. Banavar and Nina V. Fedoroff, "Fundamental Patterns Underlying Gene Expression Profiles: Simplicity from Complexity," PNAS 97: 8409--8414

Sven Bergmann, Jan Ihmels and Naama Barkai, "Iterative signature algorithm for the analysis of large-scale gene expression data," Physical Review E 67 (2003): 031902
Bower and Bolouri (eds.), Computational Modeling of Genetic and Biochemical Networks
A. J. Butte and I. S. Kohane, "Mutual Information Relevance Networks: Functional Genomic Clustering Using Pairwise Entropy Measurements" [online]
M. Caselle, F. Di Cunto, M. Pellegrino and P. Provero, "Finding regulatory sites from statistical analysis of nucleotide frequencies in the upstream region of eukaryotic genes," physics/0201033
Josh M. Deutsch, "Algorithm for Finding Optimal Gene Sets in Microarray Prediction," physics/0108011
Eytan Domany, "Cluster Analysis of Gene Expression Data," physics/0206056
R. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
Richard Durrett, Probability Models for DNA Sequence Evolution
Warren Ewens and Gregory Grant, Statistical Methods in Bioinformatics: An Introduction
Luca Ferraro, Andrea Giansanti, Giovanni Giuliano and Vittorio Rosato, "Co-expression of statistically over-represented peptides in proteomes: a key to phylogeny?", q-bio.MN/0410011
Gad Getz, Hilah Gal, Itai Kela, Eytan Domany and Dan A. Notterman, "Coupled Two-Way Clustering Analysis of Breast Cancer and Colon Cancer Gene Expression Data," physics/0206060
Gad Getz, Michele Vendruscolo, David Sachs and Eytan Domany, "Automated assignment of SCOP and CATH protein structure classification from FSSP scores," cond-mat/0102280
Alexander N. Gorban's Home Page at Northeastern University
Alexander N. Gorban, Andrey Yu. Zinovyev and Tatyana G. Popova, "Self-organizing Approach for Automated Gene Identification in Whole Genomes," physics/0108016 [Fuller version online here or here
Dan Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
Alexander K. Hartmann, "Sampling rare events: statistics of local sequence alignments," cond-mat/0108201
Lenwood S. Heath, Naren Ramakrishnan, Ronald R. Sederoff, Ross W. Whetten, Boris I. Chevone, Craig A. Struble, Vincent Y. Jouenne, Dawei Chen, Leonel van Zyl and Ruth G. Alscher, "The Expresso Microarray Experiment Management System: The Functional Genomics of Stress Responses in Loblolly Pine," cs.OH/0110047
Trinh Xuan Hoang, Marek Cieplak, Jayanth R. Banavar and Amos Maritan, "Prediction of Protein Secondary Structures From Conformational Biases," cond-mat/0201311
Rui Hu and Bin Wang, "Statistically Significant Strings are Related to Regulatory Elements in the Promoter Regions of Saccharomyces cervisiae," physics/0009002
Thomas B. Kepler, Lynn Crosby and Kevin T. Morgan, "Normalization and Analysis of DNA Microarray Data by Self-Consistency and Local Regression," SFI Working Paper 00-09-055
Cyril Laboulais, Mohammed Ouali, Marc Le Bret and Jacques Gabarro-Arpa, "Hamming distance geometry of a protein conformational space. Application to the clustering of a 4 ns molecular dynamics trajectory of the HIV-1 integrase catalytic core," physics/0110067
Ming Li, Xin Li, Bin Ma, Paul Vitanyi, "Normalized Information Distance and Whole Mitochondrial Genome Phylogeny Analysis," cs.CC/0111054 [Pardon me if I don't exactly swoon over apporximations to intrinsically uncomputable distance measures]
Wentian Li
- "DNA Segmentation as A Model Selection Process," physics/0104027
- "New stopping criteria for segmenting DNA sequences," physics/0104026
- " Zipf's Law in Importance of Genes for Cancer Classification Using Microarray Data," physics/0104028
Wentian Li, Fengzhu Sun and Ivo Grosse, "Extreme Value Distribution Based Gene Selection Criteria for Discriminant Microarray Data Analysis Using Logistic Regression", q-bio.QM/0403038
Wentian Li and Yaning Yang, "How Many Genes Are Needed for a Discriminant Microarray Data Analysis?" physics/0104029
Christopher Loose, Kyle Jensen, Isidore Rigoutsos and Gregory Stephanopoulos, "A linguistic model for the rational design of antimicrobial peptides", Nature 443 (2006): 867--869
Felix Naef, Daniel A. Lim, Nila Patil and Marcelo O. Magnasco, "From Features to Expression: High-Density Oligonucleotide Array Analysis Revisited," physics/0102010
Felix Naef, Nicholas D. Socci, and Marcelo Magnasco, "Extracting more signal at high intensities in oligonucleotide arrays," physics/0205031
Jerome K. Percus, Mathematics of Genome Analysis
Pavel A. Pevzner, Computational Molecular Biology: An Algorithmic Approach
Y. Sakakibara, "Grammatical Inference in Bioinformatics", IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005): 1051--1062
David Sankoff and Joseph Kruskal (eds.), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison
Federico Mattia Stefanini, "Identification of Highly Informative Molecular Profile Components Using Genetic Algorithms," SFI Working Paper 98-05-42
James Tisdall, Beginning Perl for Bioinformatics
Erik van Nimwegen, Mihaela Zavolan, Nikolaus Rajewsky and Eric D. Siggia, "Probabilistic Clustering of Sequences: Inferring new bacterial regulons by comparative genomics," physics/0206045
Jean-Philippe Vert, "Kernel methods in genomics and computational biology", q-bio.QM/0510032
Jean-Philippe Vert and Minoru Kanehisa, "Graph-driven features extraction from microarray data," physics/0206055
Jason L. T. Wang, Bruce A. Shaprio and Dennis Shasha (eds.), Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications
Chris Wiggins and Ilya Nemenman, "Process Pathway Inference via Time Series Analysis," physics/0206031
Andrey Zinovyev (any relation to that Zinoviev?), Genome Visualization Tools

Kristina Lisa Shalizi, CRS, Walter Fontana, "Pattern Discovery in Artificially Evolved RNA Sequences"