Lexical Complexity Analyzer

Xiaofei Lu


Lexical Complexity Analyzer is designed to automate lexical complexity analysis of English texts using 25 different measures of lexical density, variation and sophistication proposed in the first and second language development literature. This analyzer is an implementation of the system described in:

You might also be interested in the following papers, which used or discussed LCA:

The analyzer is implemented in python and runs on UNIX-like (LINUX, MAC OS, or UNIX) systems. The analyzer takes as input an English text that has been part-of-speech (POS) tagged and lemmatized. POS tagging can be done with any POS tagger that adopts the Penn Treebank POS Tagset, and the input file should be organized in the "lemma_tag" format (vertical format with one "lemma_tag" sequence per line is fine as well). Depending on the spelling of the English text (British or American), the appropriate script that calls the BNC (British Natinal Corpus) or ANC (American National Corpus) wordlist should be used. The first line of the output is a comma-delimited list of 35 field names, including 1) a filename field, 2) eight fields for recording counts of word types, sophisticated word types, lexical word types, sophisticated lexical word types, word tokens, sophisticated word tokens, lexical word tokens, and sophisticated lexical word tokens, and 3) 25 fields for the 25 indices (see Lu, 2012). Each of the subsequent lines summarizes the results for a specific input file, with a comma-delimited list of 34 values that correspond to the 34 field names. The output file can be loaded to Excel or SPSS for further statistical analysis.


Suggested taggers and lemmatizers