Saurabh Kataria
 
Ph.D. Candidate
(Advisor: Professor Prasenjit Mitra)
College of Information Science & Technology
Pennsylviania State University

Phone: +1-814-876-0852
Office: 312 IST
Email: first name [at] psu [dot] edu

http://personal.psu.edu/ssk164/











Research Interest | Education | Publications | Projects | Collaborator

Research Interest
  • Citation Recommendation
  • Statistical Machine Learning
  • Information Extraction
Education
Publications
    Refereed Conferences
     
  1. Saurabh Kataria, K. Kumar, R. Rastogi, P. Sen, S. Sengamedu. Entity Disambiguation with Hierarchical Topic Models. To appear in 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, (KDD-2011), August 21-24, 2011, San Diego, CA. pdf

  2. Cornelia Caragea, Adrian Silvescu, Saurabh Kataria, Doina Caragea, Prasenjit Mitra. Text Classification using Abstract Features. To appear in Symposium on Abstraction, Reformulation, and Approximation, (SARA-2011), July 17-18, 2011, Parador de Cardona, Spain.

  3. Saurabh Kataria, P. Mitra, C. Caragea, C. Lee Giles. Context Sensitive Topic Models for Author Influence. To appear in 22nd International Joint Conference on Artificial Intelligence (IJCAI-2011), Barcelona, Spain, July 16-22, 2011. pdf

  4. Saurabh Kataria, Luca Marchesotti, Florent Perronnin. Font retrieval on a large scale: an experimental study. In Proceedings of 2010 IEEE 17th International Conference on Image Processing, (ICIP-2010), Hong Kong, 2010. pdf

  5. Saurabh Kataria, P. Mitra, Sumit Bhatia. Utilizing Context in Generative Bayesian Models for Linked Corpus. In Association for the Advancement of Artificial Intelligence (AAAI-2010). pdf

  6. Saurabh Kataria, W. Browuer, P. Mitra, C. Lee Giles. Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents. In Association for the Advancement of Artificial Intelligence, (AAAI-2008). pdf

  7. W. Browuer, Saurabh Kataria, S. Das, P. Mitra, C. Lee Giles. Segregating and extracting overlapping data points in two-dimensional plots. In Joint Conference on Digital Libraries, (JCDL-2008). pdf

  8. Journals
     
  9. X. Lu, Saurabh Kataria, W. Brouwer, J. Z. Wang, P. Mitra, C. Lee Giles. Automated Analysis of Images in Documents for Intelligent Document Search. In International Journal on Document Analysis and Recognition, (IJDAR-2008). pdf

  10. Workshops
     
  11. Saurabh Kataria, P. Mitra, C. Lee Giles. Generative models for authorship networks. In Machine Learning for Social Computing, Neural Information Processing Systems, 2010 (NIPS-MLSC). pdf

  12. Saurabh Kataria. On Utilization of Information Extracted from Graph Images in Digital Documents. In in Bulletin of IEEE Technical Committee on Digital Libraries, (JCDL-2008).pdf


    (Publication List in DBLP)

Projects
  1. RefSeer: RefSeer supplements digital library project CiteSeerX with capability of recommending citations based upon a short description of the author's scientific interest. The underlying training algorithms learn content-citation and content-document association simultaneously and based upon the query interest, it recommends citations that are most probable with it. The context of citations is used exclusively while learning the associations offline.
    This project has an online system here. The publications associated with the project are [3,5] above.

  2. Entity disambiguation using crowd-sourced catalogue: Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this project, I focussed upon application of statistical machine learning based models (topic models to be specific) to the annotation task. Evidently, I found that this approach not only give a coherent way of bridging entity content association to entity id association but also improves upon the existing surrpunding context window based entity disambiguation approaches.
    This project was done in Summer 2010 at Yahoo Labs, India. The publications associated with the project is [1] above.

  3. Font Retrieval on large scale: In this project, I focussed on font retrieval using a query-by-example paradigm: given a font, retrieve the the most visually similar fonts. A font is described by (a) rendering a set of reference characters, (b) extracting a feature vector for each reference character and (c) concatenating character level descriptors. The similarity between two fonts is simply the similarity between the vectorial representations. The main challange in this project was to extract features that are scale invariant and most descriptive of an underlying font image. The descriptors that were chosen to evaluate were drawn from the literature on typed and handwritten text analysis. An important conclusion through experiments on approx. 9000 fonts was that the SIFT descriptor, which was shown to be state-of-the-art for object recognition in photographs and for handwriting recognition, yields the best results for font retrieval.
    This project was done in Summer 2009 at Xerox Research Center, Europe. The publication associated with the project is [4] above.

  4. Information Extraction from scientific charts:Findings in scientific studies are usually reported as charts such as 2-D line, bar and pie charts. Digital library search engines such as CiteSeerx and ChemxSeer tend to underutilize this source of information while indexing its content. The primary focus of this project is to provide the digital libraries with the capability to search for the charts which requires identification and classification of the charts, information extraction and ranking of the charts relevant to user queries
    The publications associated with the project are [6,7,8] above.

Collaborators


Visit W3Schools Last Update: June/11/2011