Lexical Complexity Analyzer is designed to automate lexical complexity
analysis of English texts using 25 different measures of lexical
density, variation and sophistication proposed in the
first and second language development literature. This
analyzer is an implementation of the system described in:
The analyzer is implemented in python and runs on UNIX-like (LINUX, MAC OS, or UNIX) systems. The analyzer takes as input an English text that has been part-of-speech (POS) tagged and lemmatized. POS tagging can be done with any POS tagger that adopts the Penn Treebank POS Tagset, and the input file should be organized in the "lemma_tag" format (vertical format with one "lemma_tag" sequence per line is fine as well). Depending on the spelling of the English text (British or American), the appropriate script that calls the BNC (British Natinal Corpus) or ANC (American National Corpus) wordlist should be used. The first line of the output is a comma-delimited list of 35 field names, including 1) a filename field, 2) eight fields for recording counts of word types, sophisticated word types, lexical word types, sophisticated lexical word types, word tokens, sophisticated word tokens, lexical word tokens, and sophisticated lexical word tokens, and 3) 25 fields for the 25 indices (see Lu, 2012). Each of the subsequent lines summarizes the results for a specific input file, with a comma-delimited list of 34 values that correspond to the 34 field names. The output file can be loaded to Excel or SPSS for further statistical analysis.