Machine learning and data mining for bioinformatics applications
High-throughput genomic and proteomic profiling harbor great
expectation for improving early detection and diagnosis of many
diseases or aid in optimizing therapeutical options for patients
suffering from various maladies. The high-throughput nature of
genomic and proteomic data, however, presents a number of
computational difficulties for the most avid explorer. The number of
potential biomarkers (genes, MS profile peaks) discriminating in
between disease and control samples can be large but only few of
these may carry a biologically significant signal. Identification of
features (signals) that are likely to provide a useful information related to disease,
as well as, relations among these features, remain important open research questions.
Our aim is to advance and develop computational machine learning solutions that scale-up
well to high-dimensional data characteristic of bioinformatics data sources. Specific
problems we are interested in include discovery and validation of potential disease biomarkers,
component (latent variable model) analysis of data, and construction of multivariate
classification models for early-detection or diagnoses of diseases based on such data.
Research:
- Learning and identification of (hidden) regulatory pathways from DNA
microarray data. Variational approximations for Bayesian
inference and learning of latent variable models from observational data.
- Development of statistical machine learning algorithms for analysis of
high-throughput proteomic and genomic data.. Methods for
sample preprocessing, univariate and multivariate feature selection,
classification model learning, model evaluation, permutation-based
validation. We have tested our methods on numerous proteomics datasets generated for
Pancreatic (UPCI), Lung (UPCI, Vanderbilt), Melanoma (UPCI) cancers, Lung disease (UPMC), Kidney transplant (UPMC),
Caries (NIH), Diabetes (Harvard), and Childhood arthritis (UPMC) studies.
- Computational protein ID methods for whole-sample proteomics. Enhancement of protein ID methods based
on additional information about proteins mined from protein databases or literature.
CS people who worked on the project:
Richard
Pelikan , now a postdoc at USC
Michal Valko , now a research scientist at INRIA, France.
Project funding:
- Department of Defense. TATRC. Protemics and bioinformatics core facility (PI: Becich)
- NCI. SPORE in Lung Cancer (PI: Siegfried)
Project collaborators:
- William Bigbee, PhD, Professor, department of epidemiology and the
Center for Environmental and Occupational Health and Toxicology,
Graduate School of Public Health, University of Pittsburgh, director
of the Proteomics Core Laboratory at UPCI
- Michael Becich, M.D., PhD.,
Chair, Department of Biomedical Informatics, University of Pittsburgh, Professor of Biomedical informatics,
Pathology, Information Sciences and Telecommunication.
- Roger Day , ScD., assistant professor, Department of Biomedical Informatics, University of Pittsburgh
- Vanathi Gopalakrishnan , PhD., assistant professor, Department
of Biomedical Informatics, University of Pittsburgh.
- Michael Lotze, MD, Director, Clinical and Translational Research
Molecular Medicine Institute, University of Pittsburgh School of Medicine
- James Lyons-Weiler, PhD, Director of Bioinformatics Analysis Services,UPMC
- David Malehorn , PhD, Clinical Proteomics Facility, UPCI
Publications:
-
Learning and identification of (hidden) regulatory signals :
- X. Lu, M. Hauskrecht, R.S. Day.
Modeling cellular processes with
variational Bayesian cooperative vector quantizer.
In the Proceedings
of the Pacific Symposium on Biocomputing (PSB), 2004.
The paper develops a
new Bayesian latent variable framework for modeling high-dimensional continuous data using a low dimensional binary representation.
The model belongs to the family of factor analysis models and lets us automatically discover hidden binary components
that drive the expression of continuous high-dimensional signals in the data. The model can be used either to
represent complex multivariate probability densities, perform component (cluster) analysis,
or achieve dimensionality reduction. Since inferences and learning of the exact model are hard we have developed variational
approximation methods to solve them. We tested the model on two problems: the decomposition of overlapping image signals
and identification of possible hidden regulatory pathways (signals) that drive the expression of genes in the cell. The Stanford cell
cycle data were used in the experiment.
- X. Lu, M. Hauskrecht, R.S. Day.
Variational Bayesian learning of
the cooperative vector quantizer model. Part I: The Theory.
Technical
Report, Center for Biomedical Informatics, CBMI-02-181, 2002.
Develops the theory for the Bayesian cooperative vector
quantizer model. It includes detailed derivations needed for the variational approximation.
-
Biomarker discovery and classification model learning for
high-throughput MS proteomic datasets
- TC. Hart, PM. Corby, M. Hauskrecht, OH Ryu, R. Pelikan, M. Valko, MB. Oliveira, GT. Hoehn, and WA. Bretz.
Identification of Microbial and Proteomic Biomarkers in Early Childhood Caries,
International Journal of Dentistry, 2011
-
M. Valko, R. Pelikan and M. Hauskrecht.
Learning predictive models for multiple heterogeneous proteomic data sources.
In the Proceedings of the Summit on Translational Bioinformatics, San Francisco, CA, March 2008.
-
M. Hauskrecht, R. Pelikan.
Inter-session reproducibility measures for high-throughput data sources.
In Proceedings of the Summit on Translational Bioinformatics, San Francisco, CA, March 2008.
- R. Pelikan, W.Bigbee, D. Malehorn, J. Lyons-Weiler, and M. Hauskrecht
Intersession Reproducibility of Mass Spectrometry Proteomic Profiles
and its Effect on the Accuracy of Multivariate Classification Models.
Bioinformatics , 23
(22), pp. 3065-3072, 2007.
In this study we examine the reproducibility of proteomic profiles generated
by surface-enhanced laser desorption/ionization time-of-flight
mass spectrometry (SELDI-TOF-MS) across multiple data-generation
sessions. We analyze the problem in terms of the reproducibility of
signals, reproducibility of discriminative features, and reproducibility
of multivariate classification models on profiles for serum samples
from early lung cancer and healthy control subjects. We show that combining data
from multiple sessions introduces additional (inter-session) noise.
While additional noise can affect the discriminative analysis, we show
that its average effect on profiles in our study is relatively small.
Moreover, for the purposes of prediction on future (previously unseen)
data, classifiers trained on multi-session data are able to adapt to
inter-session noise and improve their classification accuracy.
- M. Hauskrecht, R. Pelikan.
Enhancing the analysis of MS proteomic profiles using prior
knowledge and past data repositories.
Proceedings of 39th Symposium on the Interface of
Computing Science and Statistics: Systems Biology , 2007.
- M. Hauskrecht, R. Pelikan, M. Valko, J. Lyons-Weiler.
Feature selection and dimensionality reduction in genomics and
proteomics. invited chapter.
In Fundamentals of Data Mining in Genomics and
Proteomics, eds. Berrar, Dubitzky, Granzow. pp. 149-172, Springer, Fall 2006.
The invited chapter in a new book on analysis of genomics and proteomics data focuses on the state-of-the-art in feature selection
and its role for identification of disease biomarkers and for building of multivariate predictive models. The chapter introduces basic concepts,
overviews existing feature selection methods, points out problems typically encountered and gives recommendations on methods to be used. The
concepts, methods and issues are illustrated on an MS proteomic dataset collected for a pancreatic cancer study.
- T. Jahnukainen, D. Malehorn, M. Sun, J. Lyons-Weiler,W. Bigbee, G. Gupta, R.Shapiro, P. Randhawa, R. Pelikan, M. Hauskrecht, A. Vats.
Proteomic Analysis of Urine in Kidney Transplant Patients with BK
Virus Nephropathy.
Journal of American Society of Nephrology (JASN), 2006.
The paper applies statistical machine learning methods to identify the presence of potential biomarkers
for classification of BKVAN virus vs. acute kidney transplant rejection in high-throughput MS proteomic profiles.
These two processes are hard to distinguish using standard assays. Because of a relatively small sample size we use
multiple ML models and validation methods to confirm the presence of the discriminative signal in the MS profile data.
- M. Hauskrecht, R. Pelikan, W.L. Bigbee, D. Malehorn,
M.T. Lotze, H.J. Zeh, D.C. Whitcomb, and J. Lyons-Weiler.
Feature Selection for Classification of SELDI-TOF-MS Proteomic Profiles,
Applied Bioinformatics , 4:4, pp. 227-246, 2005.
The paper addresses one of the central problems of ML: selection of
a small set of features from high-dimensional inputs that permits learning of a more reliable classification model.
The new technique developed in the paper reduces the dimensionality of the data using univariate differential
scores and observed local correlations among features. The method is particularly suitable for predicting
outcomes in high dimensional datasets that exhibit significant correlative effects, such as high-throughput
proteomics datasets and is more computationally efficient than wrapper methods.
- J. Lyons-Weiler, R. Pelikan, H.J. Zeh III,
D.C. Whitcomb, D.E. Malehorn, W.L. Bigbee and M. Hauskrecht.
Assessing the
Statistical Significance of the Achieved Classification Error of
Classifiers Constructed Using Serum Peptide Profiles and a
Prescription for Random Resampling Repeated Studies for Massive
High-Throughput Genomic and Proteomic Studies ,
Cancer Informatics,1:1, pp. 53-77, 2005.
The paper focuses on an important ML problem: the validation of
discriminative performance of multivariate classification models and their significance. The computational method
developed in the paper (called PACE) builds upon the permutation test framework and lets the researcher eliminate
classification results that can occur by chance. The method is specifically useful to assess the trustworthiness of weak
classifiers when discriminative signals are hard to distinguish from the noise.
- R. Pelikan, M. Lotze, J. Lyons-Weiler, D. Malehorn, and
M. Hauskrecht.
Serum Proteomic
Profiling and Analysis.
In Lotze MT,
Thomson AW, eds. Measuring Immunity: Basic Biology and Clinical
Applications, Elsevier, London, 2004.
A light introduction to MS proteomic profiling for disease detection, describing
the data generated by the MS technology, preprocessing issues and the problem of construction of
predictive models from patients' spectra.
-
Protein ID methods for whole-sample proteomics
- R. Pelikan and M. Hauskrecht.
Efficient peak labeling algorithms for whole-sample mass spectrometry
proteomics.
IEEE Transactions on Computational Biology and Bioinformatics, 2008.
The paper develops a new protein ID method for whole-sample MS proteomics that relies on the knowledge of the mass of a molecule and its
expected abundance in the specimen. We develop a new probabilistic score that represents the two aspects of the problem and a new dynamic
programming procedure that assigns protein labels to peaks. We test the method on data from a virtual MS spectrometer and on serum proteomic
profiles with spiked-in protein mixtures.
Posters with our collaborators:
- William L. Bigbee, David E. Malehorn, Anna E. Lokshin, Talal El-Hefnawy,
Milos Hauskrecht, Douglas P. Landsittel, James Lyons-Weiler, Richard C. Pelikan,
Hiran C. Fernando, Rodney J. Landreneau, James D. Luketich, Joel L. Weissfeld, and Jill M. Siegfried.
Serum SELDI-TOF-MS protein expression and
Luminex xMAP marker panel profiling for lung cancer detection and
classification. 12th Annual SPORE Investigators Workshop, Baltimore MD, July 10-13,
2004.
- William L. Bigbee, David E. Malehorn, Anna E. Lokshin, Talal El-Hefnawy, Milos Hauskrecht, Douglas P. Landsittel, James Lyons-Weiler, Richard C. Pelikan, Hiran C. Fernando, Rodney J. Landreneau, James D. Luketich, Joel L. Weissfeld, and Jill M. Siegfried.
Serum SELDI-TOF-MS protein expression and
Luminex xMAP marker panel profiling for lung cancer detection and
classification. Integrated Biomedical-Informatics and Enabling Technologies Symposium. Windber, PA, August 2004.
- William Bigbee, David Malehorn, Anna Lokshin, Talal El-Hefnawy,
Milos Hauskrecht, Doug Landsittel, James Lyons-Weiler, Richard
Pelikan, Hiran Fernando, Rodney Landreneau, James Luketich, Joel
Weissfeld, Jill Siegfried. Serum SELDI-TOF-MS protein expression and
Luminex xMAP marker panel profiling for lung cancer detection and
classification. AACR: Advances in
Proteomics in Cancer Research, Key Biscayne, FL, October 6-10, 2004.
- Bigbee WL, Malehorn DE, El-Hefnawy T, Hauskrecht M, Lyons-Weiler J, Pelikan RC, Landreneau RJ,
Luketich JD, Weissfeld JL, Siegfried JM. Serum SELDI-TOF-MS protein expression profiling for lung cancer detection and classification.
Proceedings of the 96th American Association for Cancer Research Annual Meeting 2005, Anaheim, CA, 2005.
- Intersession Reproducibility and Independent Clinical Cohort Evaluation of
Lung Cancer Serum Proteomic Profiling and Classification Using
SELDI-TOF-MS. William L. Bigbee, David E. Malehorn, Talal El-Hefnawy,
Milos Hauskrecht, James Lyons-Weiler, Richard C. Pelikan, Mai Sun, Rodney
J. Landreneau, James D. Luketich, Joel L. Weissfeld, Jill M. Siegfried,
and Pierre P. Massion. Lung SPORE Midyear Meeting, Los Angeles, CA, January 2006.
- Timo Jahnukainen, David Malehorn, Gaurav Gupta, Mai Sun, James Lyons-Weiler, William Bigbee,
Parmjeet Randhawa, Richard Pelikan, Milos Hauskrecht and Abhay Vats Proteomic Analysis of urine in kidney transplant
patients with BKV nepropathy. World Transplant Congress, 2006.
- Timo Jahnukainen, David Malehorn, Gaurav Gupta, Mai Sun, James Lyons-Weiler, William Bigbee,
Parmjeet Randhawa, Richard Pelikan, Milos Hauskrecht and Abhay Vats Proteomic Analysis of urine in kidney transplant
patients with BKV nepropathy. Clinical and Translational Science Day, University of Pittsburgh, 2006.
The web page is updated by milos.