College of Information Sciences & Technology

Machine Learning and Data Mining Projects

Abstraction-Based Probabilistic Models for Sequence Classification

Recent technological advances as well as the popularity of Web 2.0 have resulted in large amounts of online data in many applications such as biomolecular sequence analysis and text classification. These applications require effective and efficient methods for classification, organization, indexing, and summarization, to facilitate retrieval of content that is tailored to the interests of specific users or groups. Hence, there is a growing need for automated methods for building predictive models from biological or text sequence data. Machine learning offers a promising approach to the design of algorithms for training computer programs to efficiently and accurately classify sequence data. However, the "bag of words" and n-gram feature representations, commonly used for sequence classification, usually result in prohibitively high dimensional input spaces. Applying Machine Learning/Data Mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms.

We developed an abstraction-based probabilistic approach to dimensionality reduction that reduces a model input size by grouping "similar" features into clusters of features. Specifically, it learns an abstraction hierarchy over the set of features using hierarchical agglomerative clustering. A cut through the resulting abstraction hierarchy specifies a compressed model, where the nodes on the cut are used as "features" in the classification model. Experimental results on text classification and protein subcellular localization prediction tasks show that the abstraction-based approach can yield significantly more accurate models compared to the "bag of words" models, using significantly smaller numbers of features (e.g., the abstraction-based models show up to 43% error reduction over the "bag of words" models using one order of magnitude smaller number of features on the task of classifying the Cora machine learning research articles).

Representative publications:

Cornelia Caragea, Adrian Silvescu, Saurabh Kataria, Doina Caragea, and Prasenjit Mitra. "Classifying Scientific Publications Using Abstract Features." In: Proceedings of the Symposium on Abstraction, Reformulation, and Approximation, Parador de Cardona, Spain, 2011. [pdf]

Cornelia Caragea, Adrian Silvescu, Doina Caragea, and Vasant Honavar. "Abstraction Augmented Markov Models." In: Proceedings of the 10th IEEE International Conference on Data Mining, Sydney, Australia, 2010. [pdf]

Adrian Silvescu, Cornelia Caragea, and Vasant Honavar. "Combining Super-Structuring and Abstraction on Sequence Classification." In: Proceedings of the 9th IEEE International Conference on Data Mining, Miami, Florida, USA, 2009. [pdf]

EMERSE: Enhanced Messaging for the Emergency Response SEctor

The Web 2.0 applications allow Internet users to collect, share, and disseminate information through the Web via sites such as Facebook, Twitter, and numerous blogs and forums. These micro-blogging practices have resulted in huge amounts of social media streams (e.g., tweets, news articles, images). Although most of these data contain ordinary information, there might be patterns in the data that diverge from the expected normal behavior and that are interesting to the analysts. These divergent patterns (e.g., emergency situations) are referred to as anomalous events or anomalies. Detecting anomalies in the data could provide valuable, and often critical, information.

Specifically, in case of emergencies (e.g., earthquakes, flooding), rapid responses are needed in order to address victims' requests for help. Social media used around crises involves self-organizing behavior that can produce accurate results, often in advance of official communications. This allows affected population to send tweets or text messages, and hence, make themselves heard. The ability to classify tweets and text messages automatically, together with the ability to deliver the relevant information to the appropriate personnel are essential for enabling the personnel to timely and efficiently work to address the most urgent needs, and to understand the emergency situation better. We developed a reusable information technology infrastructure, called Enhanced Messaging for the Emergency Response Sector (EMERSE), which classifies and aggregates tweets and text messages about the Haiti disaster relief so that non-governmental organizations, relief workers, people in Haiti, and their friends and families can easily access them. The results of our experiments show that EMERSE used around crisis helps provide rapid responses to those who need them.

Representative publications:

Cornelia Caragea, Nathan McNeese, Anuj Jaiswal, Greg Traylor, Hyun-Woo Kim, Prasenjit Mitra, Dinghao Wu, Andrea H. Tapia, C. Lee Giles, Bernard J. Jansen, John Yen. "Classifying Text Messages for the Haiti Earthquake." In: Proceedings of the 8th International Conference on Information Systems for Crisis Response and Management, Lisbon, Portugal, 2011. [pdf]

Other Projects

Cancer Informatics Initiative: Sentiment Analysis and Leader Identification in Online Health Networks

Many users join online health communities to obtain information and seek social support. Understanding the emotional impacts of online participation on patients and their informal caregivers can help provide useful insight into the design of new online health communities or enhancement of existing ones in providing better emotional support to their members. Using machine learning and text mining techniques, we developed an approach to automatically estimating the sentiment of forum posts, discovering sentiment change patterns, and allowing investigation of factors that affect the sentiment change in the cancer survivor network of the American Cancer Society. Our study shows that an estimated 75%–85% of forum participants change their sentiment in a positive direction through online interactions with other network members.

Furthermore, opinion leaders play an important role in the sustainability of the network, besides influencing other people's beliefs or actions. We designed user features such as contribution, network, and semantic features, to identify influential users in the cancer survivor network. We further exploited the structure of the network and generated neighborhood-based and cluster-based features. Classification results, using various machine learning algorithms, reveal that these features are discriminative for identification of influential users.

Representative publications:

Kang Zhao, Baojun Qiu, Cornelia Caragea, Dinghao Wu, Prasenjit Mitra, John Yen, Greta E. Greer, Kenneth Portier. "Identifying Leaders in an Online Cancer Survivor Community." In: Proceedings of the 21st Annual Workshop on Information Technologies and Systems, Shanghai, China, 2011. [pdf]

Baojun Qiu, Kang Zhao, Prasenjit Mitra, Dinghao Wu, Cornelia Caragea, John Yen, Greta E. Greer, Kenneth Portier. "Get Online Support, Feel Better-Sentiment Analysis and Dynamics in an Online Cancer Survivor Community." In: Proceedings of the Third IEEE International Conference on Social Computing, Boston, Massachusetts, USA, 2011. [pdf]

Author Influence in Document Networks

In a document network such as a citation network of scientific documents, the content produced by authors exhibits their interest in certain topics. In addition, some authors influence other authors’ interests. We modeled the influence of cited authors along with the interests of citing authors. Moreover, we hypothesized that apart from the citations present in documents, the context surrounding the citation provides extra topical information about the cited authors. We developed novel document generation schemes that incorporate the context while simultaneously modeling the interests of citing authors and influence of cited authors. Experimental results show significant improvements over baseline models for various evaluation criteria such as link prediction between document and cited author, and quantitatively explaining unseen text.

Representative publications:

Saurabh Kataria, Prasenjit Mitra, Cornelia Caragea, and C. Lee Giles. "Context Sensitive Topic Models for Author Influence in Document Networks." In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Spain, 2011. [pdf]

Bioinformatics and Computational Biology Projects

Semi-Supervised Protein Sequence Classification

Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. Because of the growing gap between the rate of acquisition and the rate of manual curation, there is significant interest in semi-supervised algorithms that can exploit large amounts of unlabeled data together with limited amounts of labeled data in training protein sequence classifiers.

We introduced semi-supervised abstraction augmented Markov models (AAMMs), which are variants of Markov models (MMs). Unlike MMs, which model the dependency of each element in a sequence on the k preceding elements, AAMMs model the dependency of each element on abstractions of the k preceding elements. The abstractions are organized in an abstraction hierarchy, which groups together k-grams that induce similar conditional probabilities of the next element in the sequence. AAMMs provide a simple way to incorporate unlabeled data into the model: first, the abstraction hierarchy is constructed using both labeled and unlabeled data. Next, the labeled data is used to estimate the model parameters based on the resulting abstraction hierarchy. We compared AAMMs with MMs that can incorporate unlabeled data through an expectation maximization approach (EM-MMs). The results of our experiments show that AAMMs can make effective use of unlabeled data and significantly outperform EM-MMs when the amount of labeled data are very small, and relatively large amounts of unlabeled data are readily available.

The implementation of the semi-supervised abstraction augmented Markov models is available upon request by sending an email to ccaragea@ist.psu.edu.

Representative publications:

Cornelia Caragea, Doina Caragea, Adrian Silvescu, and Vasant Honavar. "Semi-Supervised Prediction of Protein Subcellular Localization Using Abstraction Augmented Markov Models." BMC Bioinformatics, Special Issue on Machine Learning in Computational Biology (MLCB), 2010. [pdf]

Cornelia Caragea, Adrian Silvescu, Doina Caragea, and Vasant Honavar. "Semi-Supervised Sequence Classification Using Abstraction Augmented Markov Models." In: Proceedings of the ACM Conference on Bioinformatics and Computational Biology, Niagara Falls, New York, USA, 2010. [pdf]

Identification of RNA and DNA binding sites in proteins

Protein-RNA and protein-DNA interactions play a pivotal role in protein function. Reliable identification of such interaction sites from protein sequences has broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. Experimental detection of interaction sites must come from the determination of the structure of protein-DNA and protein-RNA complexes. However, experimental determination of such complexes lags far behind the number of known protein sequences.

We presented a mixture of experts approach to identifying functionally important sites from protein sequences (and when available, their structure, but not the complex). Our approach takes into account global similarity between biomolecular sequences when building the model and making the predictions. Specifically, given a set of sequences and a similarity measure defined on pairs of sequences, we learn a mixture of experts model by first using spectral clustering to learn the hierarchical (tree) structure of the model, and then by training an expert classifier at each leaf node. The internal nodes combine the output of each expert classifier to the root of the tree, which makes the final prediction. The results of our experiments show that global sequence similarity can be exploited to improve the performance of classifiers trained to identify functionally important sites in proteins.

The RNA- and DNA-protein interface data sets used in experiments have been compiled from structures in the Protein Data Bank and have been made available here. The implementation of the mixture of experts approach is available upon request by sending an email to ccaragea@ist.psu.edu.

Representative publications:

Cornelia Caragea, Jivko Sinapov, Drena Dobbs, and Vasant Honavar. "Mixture of experts models to exploit global sequence similarity on biomolecular sequence labeling." BMC Bioinformatics, 2009, doi:10.1186/1471-2105-10-S4-S4. [pdf]

Rasna Walia, Cornelia Caragea, Benjamin Lewis, Fadi Towfic, Yasser El-Manzalawy, Drena Dobbs, and Vasant Honavar. "Predicting Protein-RNA Interface Residues using Machine Learning Methods: A Comparative Study." Submitted to BMC Bioinformatics, September 2011 [Under Review.]

Prediction of glycosylation sites using machine learning approaches

Glycosylation is one of the most complex and ubiquitous post-translational modifications of proteins in eukaryotic cells. It is a dynamic enzymatic process in which saccharides are attached to proteins or lipoproteins, usually on serine, threonine, asparagine, and tryptophan residues. This process is clinically important because of its role in a wide variety of cellular, developmental and immunological processes, including protein folding, protein trafficking and localization, cell-cell interactions, and epitope recognition. There are four types of glycosylation based on the nature of the chemical linkage between specific acceptor residues in the protein and sugar: N-linked and O-linked glycosylation, C-mannosylation, and GPI (glycosylphosphatidylinositol) anchors. The acceptor residues represent the glycosylation sites. Experimental identification of these sites is expensive and laborious.

We developed computational methods to reliably predict glycosylation sites from protein sequences. Specifically, we explored machine learning methods for training classifiers to predict the residues that are likely to be glycosylated using information derived from the target residue and its sequence neighbors. We compared the performance of Support Vector Machines (SVMs) and ensembles of SVMs trained on a dataset of experimentally determined glycosylation sites extracted from O-GlycBase version 6.00. The results of our experiments showed that the ensembles of SVMs outperform single SVMs on the problem of predicting glycosylation sites in terms of a range of standard measures for comparing the performance of classifiers.

The resulting methods have been implemented in EnsembleGly, a web server for glycosylation site prediction.

Representative publications:

Cornelia Caragea, Jivko Sinapov, Adrian Silvescu, Drena Dobbs, and Vasant Honavar. "Glycosylation site prediction using ensembles of Support Vector Machine classifiers." BMC Bioinformatics, 2007, 8:438. [pdf]

This research was supported by grants from the National Science Foundation, the National Institutes of Health, and Lockheed Martin Corporation.