High-throughput genomic and proteomic profiling harbor great expectation for improving early detection and diagnosis of many diseases or aid in optimizing therapeutical options for patients suffering from various maladies. The high-throughput nature of genomic and proteomic data, however, presents a number of computational difficulties for the most avid explorer. The number of potential biomarkers (genes, MS profile peaks) discriminating in between disease and control samples can be large but only few of these may carry a biologically significant signal. Identification of features (signals) that are likely to provide a useful information related to disease, as well as, relations among these features, remain important open research questions. Our aim is to advance and develop computational machine learning solutions that scale-up well to high-dimensional data characteristic of bioinformatics data sources. Specific problems we are interested in include discovery and validation of potential disease biomarkers, component (latent variable model) analysis of data, and construction of multivariate classification models for early-detection or diagnoses of diseases based on such data.
Former CS members:
X. Lu, M. Hauskrecht, R.S. Day. Modeling cellular processes with
variational Bayesian cooperative vector quantizer. In the Proceedings
of the Pacific Symposium on Biocomputing (PSB), 2004.
The paper develops a
new Bayesian latent variable framework for modeling high-dimensional continuous data using a low dimensional binary representation.
The model belongs to the family of factor analysis models and lets us automatically discover hidden binary components
that drive the expression of continuous high-dimensional signals in the data. The model can be used either to
represent complex multivariate probability densities, perform component (cluster) analysis,
or achieve dimensionality reduction. Since inferences and learning of the exact model are hard we have developed variational
approximation methods to solve them. We tested the model on two problems: the decomposition of overlapping image signals
and identification of possible hidden regulatory pathways (signals) that drive the expression of genes in the cell. The Stanford cell
cycle data were used in the experiment.
X. Lu, M. Hauskrecht, R.S. Day. Variational Bayesian learning of
the cooperative vector quantizer model. Part I: The Theory. Technical
Report, Center for Biomedical Informatics, CBMI-02-181, 2002.
Develops the theory for the Bayesian cooperative vector
quantizer model. It includes detailed derivations needed for the variational approximation.
M. Valko, R. Pelikan and M. Hauskrecht. Learning predictive models for multiple heterogeneous proteomic data sources. In the Proceedings of the Summit on Translational Bioinformatics, San Francisco, CA, to appear, March 2008.
M. Hauskrecht, R. Pelikan. Inter-session reproducibility measures for high-throughput data sources. In Proceedings of the Summit on Translational Bioinformatics, San Francisco, CA, to appear, March 2008.
R. Pelikan, W.Bigbee, D. Malehorn, J. Lyons-Weiler, and M. Hauskrecht
Intersession Reproducibility of Mass Spectrometry Proteomic Profiles
and its Effect on the Accuracy of Multivariate Classification Models. Bioinformatics , 23
(22), pp. 3065-3072, 2007.
In this study we examine the reproducibility of proteomic profiles generated
by surface-enhanced laser desorption/ionization time-of-flight
mass spectrometry (SELDI-TOF-MS) across multiple data-generation
sessions. We analyze the problem in terms of the reproducibility of
signals, reproducibility of discriminative features, and reproducibility
of multivariate classification models on profiles for serum samples
from early lung cancer and healthy control subjects. We show that combining data
from multiple sessions introduces additional (inter-session) noise.
While additional noise can affect the discriminative analysis, we show
that its average effect on profiles in our study is relatively small.
Moreover, for the purposes of prediction on future (previously unseen)
data, classifiers trained on multi-session data are able to adapt to
inter-session noise and improve their classification accuracy.
M. Hauskrecht, R. Pelikan, M. Valko, J. Lyons-Weiler.
Feature selection and dimensionality reduction in genomics and
proteomics. invited chapter. In Fundamentals of Data Mining in Genomics and
Proteomics, eds. Berrar, Dubitzky, Granzow. pp. 149-172, Springer, Fall 2006.
The invited chapter in a new book on analysis of genomics and proteomics data focuses on the state-of-the-art in feature selection
and its role for identification of disease biomarkers and for building of multivariate predictive models. The chapter introduces basic concepts,
overviews existing feature selection methods, points out problems typically encountered and gives recommendations on methods to be used. The
concepts, methods and issues are illustrated on an MS proteomic dataset collected for a pancreatic cancer study.
T. Jahnukainen, D. Malehorn, M. Sun, J. Lyons-Weiler,W. Bigbee, G. Gupta, R.Shapiro, P. Randhawa, R. Pelikan, M. Hauskrecht, A. Vats.
Proteomic Analysis of Urine in Kidney Transplant Patients with BK
Virus Nephropathy. Journal of American Society of Nephrology (JASN), 2006.
The paper applies statistical machine learning methods to identify the presence of potential biomarkers
for classification of BKVAN virus vs. acute kidney transplant rejection in high-throughput MS proteomic profiles.
These two processes are hard to distinguish using standard assays. Because of a relatively small sample size we use
multiple ML models and validation methods to confirm the presence of the discriminative signal in the MS profile data.
M. Hauskrecht, R. Pelikan, W.L. Bigbee, D. Malehorn,
M.T. Lotze, H.J. Zeh, D.C. Whitcomb, and J. Lyons-Weiler.
Feature Selection for Classification of SELDI-TOF-MS Proteomic Profiles,
Applied Bioinformatics , 4:4, pp. 227-246, 2005.
The paper addresses one of the central problems of ML: selection of
a small set of features from high-dimensional inputs that permits learning of a more reliable classification model.
The new technique developed in the paper reduces the dimensionality of the data using univariate differential
scores and observed local correlations among features. The method is particularly suitable for predicting
outcomes in high dimensional datasets that exhibit significant correlative effects, such as high-throughput
proteomics datasets and is more computationally efficient than wrapper methods.
J. Lyons-Weiler, R. Pelikan, H.J. Zeh III,
D.C. Whitcomb, D.E. Malehorn, W.L. Bigbee and M. Hauskrecht. Assessing the
Statistical Significance of the Achieved Classification Error of
Classifiers Constructed Using Serum Peptide Profiles and a
Prescription for Random Resampling Repeated Studies for Massive
High-Throughput Genomic and Proteomic Studies , Cancer Informatics,1:1, pp. 53-77, 2005.
The paper focuses on an important ML problem: the validation of
discriminative performance of multivariate classification models and their significance. The computational method
developed in the paper (called PACE) builds upon the permutation test framework and lets the researcher eliminate
classification results that can occur by chance. The method is specifically useful to assess the trustworthiness of weak
classifiers when discriminative signals are hard to distinguish from the noise.
R. Pelikan, M. Lotze, J. Lyons-Weiler, D. Malehorn, and
M. Hauskrecht. Serum Proteomic
Profiling and Analysis. In Lotze MT,
Thomson AW, eds. Measuring Immunity: Basic Biology and Clinical
Applications, Elsevier, London, 2004.
A light introduction to MS proteomic profiling for disease detection, describing
the data generated by the MS technology, preprocessing issues and the problem of construction of
predictive models from patients' spectra.
R. Pelikan and M. Hauskrecht. In-silico protein identification methods
for whole-sample proteomics. IEEE Transactions on Computational Biology and Bioinformatics , submitted
The paper develops a new protein ID method for whole-sample MS proteomics that relies on the knowledge of the mass of a molecule and its
expected abundance in the specimen. We develop a new probabilistic score that represents the two aspects of the problem and a new dynamic
programming procedure that assigns protein labels to peaks. We test the method on data from a virtual MS spectrometer and on serum proteomic
profiles with spiked-in protein mixtures.
Presentations of the CS group (in chronological order):
The web page is updated by milos.