Lexical Complexity Analyzer

Xiaofei Lu


Lexical Complexity Analyzer is designed to automate lexical complexity analysis of English texts using 25 different measures of lexical density, variation and sophistication proposed in the first and second language development literature. This analyzer is an implementation of the system described in:

The analyzer is implemented in python and runs on UNIX-like (LINUX, MAC OS, or UNIX) systems. The analyzer takes as input an English text that has been part-of-speech (POS) tagged and lemmatized. POS tagging can be done with any POS tagger that adopts the Penn Treebank POS Tagset, and the input file should be organized in the "lemma_tag" format (vertical format with one "lemma_tag" sequence per line is fine as well). Depending on the spelling of the English text (British or American), the appropriate script that calls the BNC (British Natinal Corpus) or ANC (American National Corpus) wordlist should be used. The first line of the output is a comma-delimited list of 35 field names, including 1) a filename field, 2) nine fields for recording counts of sentences, word types, sophisticated word types, lexical word types, sophisticated lexical word types, word tokens, sophisticated word tokens, lexical word tokens, and sophisticated lexical word tokens, and 3) 25 fields for the 25 indices (see Lu, 2012). Each of the subsequent lines summarizes the results for a specific input file, with a comma-delimited list of 35 values that correspond to the 35 field names. The output file can be loaded to Excel or SPSS for further statistical analysis.


Suggested taggers and lemmatizers