"If you keep proving stuff that others have done, getting confidence, increasing the complexities of your solutions - for the fun of it - then one day you'll turn around and discover that nobody actually did that one! And that's the way to become a computer scientist." : Richard Feynman, Feynman Lectures on Computation

Research Interest

My primary research interests are information extraction, computer vision and applied machine learning.

Over the last two years, I have been working on developing an architecture for understanding figures in scholarly documents . Usually scholarly documents have line graphs, bar charts and other plots. These plots are generated from some data (think how we create a line graph from a table in Excel) that gets lost in the paper. Our goal is to reverse engineer the process: we start with a scholarly paper (a PDF), extract the figures and metadata, extract the data from these figures and merge that with the context (typically caption/mention of the figure) to create a "searchable" natural language summary of the figure. Also, the extracted data can be put into a database so that structred queries can be made.

See the natural language summary for a line graph extracted from a recent paper .

Related publications

Sagnik Ray Choudhury, Shuting Wang, C. Lee Giles. Curve Separation for Line Graphs in Scholarly Documents. JCDL, 2016. paper

Sagnik Ray Choudhury, Shuting Wang, C. Lee Giles. Scalable Algorithms for Scholarly Figure Mining and Semantics. SBD, SIGMOD 2016. paper

Sagnik Ray Choudhury, Shuting Wang, Prasenjit Mitra, C. Lee Giles. Automated Data Extraction from Scholarly Color Line Graphs. GREC, 2015. paper. This paper wasn't published in the proceedings because we couldn't go to the conference to present it. An extended version is in submission. So please do not cite this.

Sagnik Ray Choudhury, Prasenjit Mitra, C. Lee Giles. Automatic Extraction of Figures from Scholarly Documents. DocEng, 2015. paper

Sagnik Ray Choudhury, C. Lee Giles. An Architecture for Information Extraction from Figures in Digital Libraries. WWW (Companion Volume), 2015. paper

Sagnik Ray Choudhury, Suppawong Tuarob, Prasenjit Mitra, Lior Rokach, Andi Kirk, Silvia Szep, Donald Pellegrino, Sue Jones, C. Lee Giles. A figure search engine architecture for a chemistry digital library. JCDL 2013. paper

Sagnik Ray Choudhury, Prasenjit Mitra, Andi Kirk, Silvia Szep, Donald Pellegrino, Sue Jones, C. Lee Giles. Figure Metadata Extraction from Digital Documents. ICDAR 2013. paper

Other Publications

I also collaborated with other people in related problems that resulted in some publications:

Hamed Alhoori, Sagnik Ray Choudhury, Tarek Kanan, Edward Fox, Richard Furuta, C Lee Giles. On the Relationship between Open Access and Altmetrics. iConference 2015. paper

Kyle Williams, Jian Wu, Sagnik Ray Choudhury, Madian Khabsa, C. Lee Giles: Scholarly big data information extraction and integration in the CiteSeerX digital library. ICDE Workshops 2014. paper

Zhaohui Wu, Jian Wu, Madian Khabsa, Kyle Williams, Hung-Hsuan Chen, Wenyi Huang, Suppawong Tuarob, Sagnik Ray Choudhury, Alexander Ororbia, Prasenjit Mitra, C. Lee Giles. Towards building a scholarly big data platform: Challenges, lessons and opportunities. JCDL 2014. paper

Shibamouli Lahiri, Sagnik Ray Choudhury, Cornelia Caragea. Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks. CoRR abs/1401.6571 (2014). paper

Kyle Williams, Hung-Hsuan Chen, Sagnik Ray Choudhury, C. Lee Giles. Unsupervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2013. CLEF (Working Notes) 2013. paper

Madian Khabsa, Stephen Carman, Sagnik Ray Choudhury, C Lee Giles. A framework for bridging the gap between open source search tools. SIGIR 2012 Workshop on Open Source Information Retrieval. paper

Home Research