Friday, January 27, 2006

Wiktionary word frequency lists

Wikipedia's dictionary has word frequency lists calculated from Project Gutenberg texts. Taking a random sample:

"language that's House los individual South mon meant food wide now formed"

Reveals several problems. No lemmatization ("that's", "formed"). "mon" is either an abbreviation for "Monday" or a mistake (e.g. typo, word split at line break, etc.). "los" is either Spanish , perhaps from "Los Angeles" or a mistake. Capitilized and uncapitilized are apparently counted separately.

If a realistic type/token ratio is to be calculated that shows how many unique words the reader was exposed to, you probably have to go even further and count word families.

(Note: Chinese characters make it a lot easier since in Chinese (an analytic language) character morpheme breaks make defining exactly what is to be counted a lot easier.)

No comments: