L2 Syntactic Complexity Analyzer
Xiaofei Lu
About
L2 Syntactic Complexity Analyzer is designed to automate syntactic complexity analysis of written English language samples produced by advanced learners of English using fourteen different measures proposed in the second language development literature. The analyzer takes a written English language sample in plain text format as input and generates 14 indices of syntactic complexity of the sample. This software is an implementation of the system described in:
- Lu, Xiaofei (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4):474-496.
The analyzer is implemented in python and runs on UNIX-like (LINUX, MAC OS, or UNIX) systems with Java 1.5 and python 2.5 or higher installed. The analyzer takes as input a plain text file, counts the frequency of the following 9 structures in the text: words (W), sentences (S), verb phrases (VP), clauses (C), T-units (T), dependent clauses (DC), complex T-units (CT), coordinate phrases (CP), and complex nominals (CN), and computes the following 14 syntactic complexity indices of the text: mean length of sentence (MLS), mean length of T-unit (MLT), mean length of clause (MLC), clauses per sentence (C/S), verb phrases per T-unit (VP/T), clauses per T-unit (C/T), dependent clauses per clause (DC/C), dependent clauses per T-unit (DC/T), T-units per sentence (T/S), complex T-unit ratio (CT/T), coordinate phrases per T-unit (CP/T), coordinate phrases per clause (CP/C), complex nominals per T-unit (CN/T), and complex nominals per clause (CP/C). The analyzer calls the Stanford praser (Klein & Manning, 2003) to parse the input file and Tregex (Levy & Andrew, 2006) to query the parse trees. Both the Stanford parser and Tregex are bundled in this download and installation along with the appropriate licenses.
Download
- Download L2 Syntactic Complexity Analyzer 2.4 (04/16/2012) (with Stanford Parser 2.0.1 and Tregex 2.0.2 bundled). If you are using L2SCA 2.3.1 or earlier, please download the current version, as there was a minor bug in the word count function in previous versions.
- Decompress the L2SCA-2012-04-16.tgz file using the following command: tar -xzf L2SCA-2012-04-16.tgz
- Follow the instructions in the README-L2SCA.txt file in the decompressed directory L2SCA-2012-04-16
Frequently asked questions
- Why are there more words than what MS Office Word tells me? You may see a discrepancy between the word count returned by the analyzer and that returned by other tools (e.g., MS Office Word). This is because in the tokenization process, contracted forms such as I'd, can't, wasn't, etc. are separated into two tokens and each is counted as a word.
- Why do I get a list of 0's? If the output file contains a list of 0's, first make sure that you called the python script within, not outside of the L2SCA-2012-04-16 folder. Then make sure that your input file is a valid plain text file and that it can be successfully parsed by the Stanford parser. You can do this by following the instructions in the README.txt file in the stanford-parser-2012-03-09 directory. The parser may fail either because your input file is not a clean plain text file or because there is not enough memory - in this latter case, you will receive an out-of-memory message, and you can increase the memory limit by modifying the lexparser.sh file in the stanford-parser-2012-03-09 directory. If the file cannot be parsed by the parser, the analyzer will not be able to do anything but return a list of 0's.
- Feel free to email me with any other bugs or questions.