APLNG 597C/ANTH 597A

Statistical Analysis of Qualitative and Corpus Data

Spring 2008

Pennsylvania State University

 

General Information

Instructor:       Xiaofei Lu & Robert Schrauf

Mailbox:         305 Sparks Building

Office:             301 / 207 Sparks Building

Phone:            865-4692 / 865-9622

Email:             xxl13 / rws23 @ psu.edu

Webpage:       All additional course information posted in ANGEL

Lectures:         Monday & Wednesday, 2:30pm-3:45pm, 009 Sparks

Office hours:  By appointment

 

Required Textbook

Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh University Press.

 

Course Description

Qualitative data and corpora include transcripts of interviews, narratives, conversations, and print materials, and working with such data requires coding and interpreting these texts. This course is designed to equip the student with the basic statistical skills necessary for testing theories and drawing conclusions from textual data and for designing visual presentations of that data.

 

Course Outline

A.  Introduction: the qual/quant continuum; mixed methods research; using statistics to analyze qualitative data.

 

B.  Transcripts as “Corpora”: the concept of corpus, criteria for building good corpora, and issues in treating transcripts, including interviews, narratives, conversations, and published texts, as corpora.

 

C.  Kinds of Analysis and Associated Software: introduction to software that are useful for statistical analysis of qualitative and corpus data. 

a.    For linguistic analysis at the levels of word, phrase, collocation, sentence, paragraph, document, and genre:

i.        AntConc/ WordSmith: examines how words, word-clusters, or phrases behave in texts, such as their frequencies, contexts of occurrence, associations with other words, and keyness in texts.

ii.      Coh-Metrix: produces indices of the linguistic and discourse representations of a text. These values can be used in different ways to investigate the cohesion of the explicit text and the coherence of the mental representation of the text.

b.    For computerized coding and analysis of transcripts in behavioral science:

i.        Review of available software

ii.      Linguistic Inquiry and Word Count (LIWC): examines standard linguistic items (nouns, pronouns, articles), psychological processes (emotions, agency), relativity (temporal relations), personal concerns (e.g. school, religion, sexuality)

c.    For human coding of qualitative data in the social sciences:

i.        Atlas.ti / NVivo / Ethnograph: software facilitating the coding of transcripts, developing code families, testing relationships, etc (used especially with Grounded Theory approaches).

ii.      Traditional paper-and-pencil (and highlighters and post-its and colored files, etc) approaches.

 

D.  Statistics in Data Collection and Preparation: In the data collection/data preparation stage, several important statistical issues arise:

                                     a.      Sample-size (i.e. determination of how many interviews, how many ‘texts’ are necessary for generalizability.

                                    b.      Data matrices for data preparation, including respondent-by-item matrices, item-by-item matrices, respondent-by-respondent matrices, and unit-by-theme matrices

                                     c.      Intercoder reliability (i.e. methods for assessing agreement among coders in applying the codes to the text).

 

E.     Basic Statistical Concepts and Methods that are necessary and useful for statistical analysis of qualitative and corpus data, including describing data, comparing groups, describing relationships, log-linear modeling, and Bayesian statistics.

 

F.   Statistics for Analyzing Data and the Visual Presentation of Results

a.      Analysis of cross-classified data: two-by-two and more complex contingency tables, odds-rations and the log-linear model, and ways to graph the results. 

b.      Metric scaling: co-occurrence of words or codes; analysis and visualization, including principal components analysis, multidimensional scaling (the group maps and individual maps), and correspondence analysis.

 

Course Requirements

Class meetings will involve hands-on treatment of data sets, either provided by the instructor, collected by the group, or volunteered by students.  For each procedure, the instructors will offer “big picture” explanations, followed by step-by-step examples in the appropriate software (e.g. SPSS, Excel, or one of the specialized programs listed in this syllabus), plus assigned problem solving for homework between classes. The course will include three take-home exams to be worked in Excel and/or SPSS.  During the course, students will be encouraged to set up a data set and analyze it using one of the methods that interest them in particular.  In the last several classes, students will make presentations of these projects.

 

Grading

Exams count for 75% percent of the grade (each exam contributing 25%), and the final presentation counts for the remaining 25%.

 

Academic Misconduct

Penn State defines academic integrity as the pursuit of scholarly activity in an open, honest and responsible manner.  All students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts (Faculty Senate Policy 49-20).  Dishonesty of any kind will not be tolerated in this course. Dishonesty includes, but is not limited to, cheating, plagiarizing, fabricating information or citations, facilitating acts of academic dishonesty by others, having unauthorized possession of examinations, submitting work of another person or work previously used without informing the instructor, or tampering with the academic work of other students.  Students who are found to be dishonest will receive academic sanctions and will be reported to the University’s Judicial Affairs office for possible further disciplinary sanction. 

Tentative Schedule

 

Week

Date

Topic

Readings

What’s due

1

M 01/14

Introduction

 

 

 

W 01/16

Transcripts as “Corpora”

 

 

2

M 01/21

Martin Luther King Day - No Classes

 

 

 

W 01/23

AntConc / WordSmith

 

 

3

M 01/28

Coh-Metrix

 

 

 

W 01/30

Review of Computer Coding Software

Linguistic Inquiry and Word Count

 

 

4

M 02/04

Atlas.ti / NVivo / Ethnograph

 

 

 

W 02/06

Sample Size

 

 

5

M 02/11

Data Matrices

 

 

 

W 02/13

Describing Data

 

 

6

M 02/18

Describing Data/Comparing Groups

 

 

 

W 02/20

Comparing Groups

 

 

7

M 02/25

Comparing Groups

 

Exam 1

 

W 02/27

Describing Relationships

 

 

8

M 03/03

Describing Relationships

 

 

 

W 03/05

Intercoder Reliability

 

 

9

Spring break

10

M 03/17

Loglinear Modeling

 

 

 

W 03/19

Loglinear Modeling

 

 

11

M 03/24

Analysis of Cross-Classified Data

 

Exam 2

 

W 03/26

Analysis of Cross-Classified Data

 

 

12

M 03/31

Bayesian Statistics

 

   

 

W 04/02

Bayesian Statistics

 

 

13

M 04/07

Metric Scaling

 

 

 

W 04/09

Metric Scaling

 

 

14

M 04/14

Metric Scaling

 

Exam 3

 

W 04/16

Metric Scaling

 

 

15

M 04/21

Catch up/Final Presentations

 

 

 

W 04/23

Final Presentations

 

 

16

M 04/28

Final Presentations

 

 

 

W 04/30

Final Presentations