L2 Syntactic Complexity Analyzer

Xiaofei Lu


L2 Syntactic Complexity Analyzer is designed to automate syntactic complexity analysis of written English language samples produced by advanced learners of English using fourteen different measures proposed in the second language development literature. The analyzer takes a written English language sample in plain text format as input and generates 14 indices of syntactic complexity of the sample. This software is an implementation of the system described in:

You might also be interested in the following papers, which used or discussed L2SCA:

The analyzer is implemented in python and runs on UNIX-like (LINUX, MAC OS X, or UNIX) systems with Java 1.6 and python 2.5 or higher installed. A minimum of 2GB memory is recommended. The analyzer takes as input a plain text file, counts the frequency of the following 9 structures in the text: words (W), sentences (S), verb phrases (VP), clauses (C), T-units (T), dependent clauses (DC), complex T-units (CT), coordinate phrases (CP), and complex nominals (CN), and computes the following 14 syntactic complexity indices of the text: mean length of sentence (MLS), mean length of T-unit (MLT), mean length of clause (MLC), clauses per sentence (C/S), verb phrases per T-unit (VP/T), clauses per T-unit (C/T), dependent clauses per clause (DC/C), dependent clauses per T-unit (DC/T), T-units per sentence (T/S), complex T-unit ratio (CT/T), coordinate phrases per T-unit (CP/T), coordinate phrases per clause (CP/C), complex nominals per T-unit (CN/T), and complex nominals per clause (CN/C). The analyzer calls the Stanford parser (Klein & Manning, 2003) to parse the input file and Tregex (Levy & Andrew, 2006) to query the parse trees. Both the Stanford parser and Tregex are bundled in this download and installation along with the appropriate licenses.

Download the original impelmentation of L2SCA

Other ways to access L2SCA

Frequently asked questions

  • Why are there more words than what MS Office Word tells me? You may see a discrepancy between the word count returned by the analyzer and that returned by other tools (e.g., MS Office Word). This is because in the tokenization process, contracted forms such as I'd, can't, wasn't, etc. are separated into two tokens and each is counted as a word.
  • Why do I get a list of 0's? If the output file contains a list of 0's, first make sure that you called the python script within, not outside of the L2SCA-2016-06-30 folder. Then make sure that your input file is a valid plain text file and that it can be successfully parsed by the Stanford parser. You can do this by following the instructions in the README.txt file in the stanford-parser-full-2014-01-04 directory. The parser may fail either because your input file is not a clean plain text file or because there is not enough memory - in this latter case, you will receive an out-of-memory message, and you can increase the memory limit by modifying the lexparser.sh file in the stanford-parser-full-2014-01-04 directory. If the file cannot be parsed by the parser, the analyzer will not be able to do anything but return a list of 0's.