APLNG 597E: Introduction to Corpus Linguistics

Spring 2007

Pennsylvania State University

 

General Information

Instructor:      Xiaofei Lu

Office:            301 Sparks Building

Mailbox:        305 Sparks Building

Phone:            (814) 8654692

Email:             xxl13 at psu dot edu

Webpage:      http://www.personal.psu.edu/xxl13/teaching/sp07/apling597e

Lectures:        T R 1:00-2:15pm, 069 Willard

Office hours: T 2:30-4:30pm and by appointment

 

Required books

1.      Douglas Biber, Susan Concrad, and Randi Reppen (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press.

2.      Allen Downey, Jeff Elkner, and Chris Meyers (2002). How to Think Like a Computer Scientist: Learning with Python. Green Tea Press.

3.      Graeme Kennedy (1998). An Introduction to Corpus Linguistics. Longman.

4.      Martin Wynne (Ed.) (2005). Developing Linguistic Corpora: a Guide to Good Practice. Oxbow Books.

 

Course Objectives

This course provides a hands-on introduction to the use of large text corpora in the study of language. The specific objectives of the course are to help students:

  • To understand the advantages and limitations of corpus-based linguistic research
  • To become proficient in using existing computational tools and quantitative methods for corpus compilation, annotation, and analysis
  • To start thinking about and practicing developing their own computational tools for text processing based on individual research needs
  • To engage with and produce research using large text corpora that articulate with their own research interests

 

Course Outline

This course will be organized around the following 5 topics.

  • Introduction. We will start the course with a discussion of the theoretical and historical background of corpus-based linguistic research, the principles for corpus design, and tools for corpus compilation.
  • UNIX and Python. In the second part of the course, we will introduce some useful UNIX tools and basic Python programming for text processing. Students should apply for a UNIX account with ITS. 
  • Corpus annotation. Next, we will spend a few weeks on the issues, algorithms, and tools for corpus annotation at increasingly complex linguistic levels (e.g., sentence segmentation, tokenization, part-of-speech (POS) tagging, etc.).
  • Corpus analysis. The fourth part of the course will focus on the quantitative methods and tools for various levels of corpus analysis (e.g., word frequency, collocation, keyword analysis, etc.).
  • Case studies. We will conclude the course by examining classical case studies that employ large text corpora to address theoretically and pedagogically interesting linguistic issues.

 

Course Requirements

  • Participation (10%). Students are expected to read all required readings and actively participate in in-class discussions.
  • Presentations (15%). Each student will be asked to join a 2-people panel, which will co-present the case study papers for one class in consultation with the instructor (10%), as well as to give a 10-minute presentation on his or her research project proposal (5%). Students will be evaluated on the effectiveness of both the organization and style of their presentations.
  • Labs (50%). This course has a very strong practical component. There will be a number of labs designed to help students master the skills and techniques for corpus processing, annotation and analysis. Collaborative work is encouraged for the computational aspects of labs, but students are expected to write up all lab reports independently.
  • Research project proposal (25%). At the end of the course, students are expected to produce a 10-page proposal for a research project that involves the use of corpora. The proposal is due in class on Thursday, 5/3/2007. 

 

Make-up Policy

  • Unless accompanied with official documentation of an acceptable reason, late submissions of labs, take-home assignments, and research project proposal may be subject to a 10% penalty for each day late (including Saturdays and Sundays).
  • If you have to miss a scheduled presentation for an acceptable reason, please notify me immediately so that your presentation may be rescheduled.

 

Academic Misconduct

All suspected academic dishonesty (e.g., plagiarism, faking data/analysis, etc.) will be reported to the Academic Integrity Committee and, if verified, will be subject to academic and/or disciplinary sanctions. 

 

Tentative Schedule

 

W

D

Date

Topic

Readings

Presenters

1

T

1/16

Intro

Kennedy (1998): Ch1; 2.1-2.4

 

R

1/18

Corpus design & compilation

Kennedy (1998): 2.5-2.7

Wynne (2005): Ch1

 

2

T

1/23

UNIX tools

Brew & Moens (2002): Ch3

Church: UNIX for Poets

 

R

1/25

UNIX tools

 

3

T

1/30

Lab 1: UNIX tools

 

R

2/1

Python 1

Downey et al. (2002): Ch1-4

 

4

T

2/6

Lab 2: Python

 

R

2/8

Python 2

Downey et al. (2002): Ch5-8

 

5

T

2/13

Lab 3: Python

 

R

2/15

Python 3

Downey et al. (2002): Ch9-11

 

6

T

2/20

Lab 4: Python

 

R

2/22

Annotation overview

Wynne (2005): Ch2

Kennedy (1998): 4.1

 

7

T

2/27

Sentence & word segmentation

Grafenstette & Tapanainen (1994)

 

R

3/1

Lab 5: Sentence & word segmentation

 

8

T

3/6

POS tagging

Lu (2005): pp. 26-41

Schmid (1994)

 

R

3/8

Lab 6: POS tagging (TreeTagger, Penn Treebank Tagset)

 

9

Spring Break

 

10

T

3/20

Corpus analysis overview

Teubert (2005)

Biber et al. (1998) Ch1; IV.6

Kennedy (1998): 4.2.1; 3.1.1-3.1.2

 

R

3/22

Lexical analysis

Biber et al. (1998): Ch2

Rayson et al. (2004)

 

11

T

3/27

Lab 7: Lexical analysis

 

R

3/29

Collocation analysis

Kennedy (1998): 4.2.2-4.2.3; 3.1.3

Manning & Schutze (1999): Ch5

 

12

T

4/3

Lab 8: Collocation analysis

 

R

4/5

Grammar & lexico-grammar

Kennedy (1998): 3.2-3.3

Tom, Tracy

13

T

4/10

Grammar & lexico-grammar

Biber et al. (1998): Ch3-4

Park, Wendy

R

4/12

Lab 9: Grammar & lexico-grammar

 

14

T

4/17

Variation

Biber et al. (1998): Ch6

Kennedy (1998): 3.5

Davi, Hyewon

R

4/19

Discourse & stylistic analysis

Biber et al. (1998): Ch5 & Ch 8

 

Nathan, So-Eun

15

T

4/24

Lab 10: Variation and stylistic analysis

R

4/26

Lang acquisition; applications

Biber et al. (1998): Ch7

Kennedy et al. (1998): Ch5

Jie, Wei

16

T

5/1

Research project presentations

Tom, Tracy, Park, Davi, Hyewon

R

5/3

Research project presentations; proposals due

Jie, Nathan, So-Eun, Wei

 

 

Additional References

1.      Brew, C. and M. Moens (2002). Data-Intensive Linguistics. Manuscript.

2.      Church, K. UNIX for Poets. AT&T Research.

3.      Grefenstette, G. and P. Tapanainen (1994). What is a word, what is a sentence? Problems of tokenization. In Proceedings of the Third Conference on Computational Lexicography and Text Research (COMPLEX-94). Budapest, Hungary.

4.      Lu, X. (2005). Candidacy exam. The Ohio State University.

5.      Manning, C. D. and H. Schutze (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.

6.      Rayson P., D. Berridge and B. Francis (2004). Extending the Cochran rule for the comparison of word frequencies between corpora. In Proceedings of the 7th International Conference on Statistical Analysis of Textual Data, pp. 926 - 936. Louvain-la-Neuve, Belgium.

7.      Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, pp 44-49. Manchester, England.

8.      Teubert, W. (2005). My version of corpus linguistics. International Journal of Corpus Linguistics 10(1), 1-13.