APLNG 596D: Computational and Statistical Methods for Corpus Analysis

2009 Summer Institute in Applied Linguistics

Pennsylvania State University

 

General Information

 

Instructor:        Xiaofei Lu

Office:              301 Sparks Building

Mailbox:          305 Sparks Building

Phone:             (814) 8654692

Email:             xxL13 AT psu DOT edu

Meetings:         MTRF 4:15-6:15pm, 15A Sparks

 

Course Description

 

This course provides a hands-on introduction to the core and advanced computational and statistical methods for analyzing corpus data. We will first introduce some of the state-of-the-art computational tools for text processing and linguistic annotation and demonstrate tools that can be used to query raw and linguistically annotated corpora to extract occurrences of specific linguistic patterns and grammatical structures. Next, we will cover some of the most essential statistical methods used in analyzing and interpreting information extracted from text corpora. We will conclude with a discussion on how these methods have been combined in recent corpus-based studies, and how they may be implemented in student-proposed research projects. This course will be highly applied, and there will be substantial opportunities for demonstrations, exercises, and discussions. By the end of the course, students are expected to have a good grasp of the computational and statistical techniques necessary for processing, annotating, and analyzing corpus data.

 

Course Requirements

 

For students who register for graduate credit, evaluation will be based on participation and a short take-home assignment to be distributed on Friday 7/10 and due on Friday 7/17.

 

Tentative Schedule

 

 

Day

Topic

Resources

Readings

1

M, 7/6

Overview

GOLD

Wynne (2005): Ch1-2

2

T, 7/7

Analyzing raw data

AntConc; CHILDES and CLAN; MICASE; BNC

Lu (in press)

3

R, 7/9

Computed-assisted annotation

Stanford Manual Annotation Tool, UAM Corpus Tool

Granger (2003)

4

F, 7/10

POS tagging and morphological analysis

Stanford POS Tagger; MORPH

Biber (2006): Ch3

5

M, 7/13

Syntactic parsing

Collin’s parser; Stanford Parser; D-Level Analyzer; Tregex

Lu (2009)

6

T, 7/14

Statistical analysis

Learning SPSS

 

7

R, 7/16

Statistical analysis

 

 

8

F, 7/17

Putting it all together

 

 

 

Recommended readings

 

1.    Biber, Douglas (2006). University Language: A Corpus-Based Study of Spoken and Written Registers. Amsterdam: John Benjamins.

2.    Granger, Sylviane. (2003). Error-tagged learner corpora and CALL: A promising synergy, CALICO Journal, 20(3): 465–80.

3.    Lu, Xiaofei (2009). Automatic analysis of syntactic complexity in child language acquisition. International Journal of Corpus Linguistics, 14(1): 3-28.

4.    Lu, Xiaofei (in press). What can corpus software reveal about language development? In Michael McCarthy & Anne O'Keeffe (eds.), Routledge Handbook of Corpus Linguistics. Oxfordshire, UK: Routledge.

5.    Wynne, Martin (Ed.) (2005). Developing Linguistic Corpora: a Guide to Good Practice. Oxford, UK: Oxbow Books.