Wikipedia's dictionary has word frequency lists calculated from Project Gutenberg texts. Taking a random sample:
"language that's House los individual South mon meant food wide now formed"
Reveals several problems. No lemmatization ("that's", "formed"). "mon" is either an abbreviation for "Monday" or a mistake (e.g. typo, word split at line break, etc.). "los" is either Spanish , perhaps from "Los Angeles" or a mistake. Capitilized and uncapitilized are apparently counted separately.
If a realistic type/token ratio is to be calculated that shows how many unique words the reader was exposed to, you probably have to go even further and count word families.
(Note: Chinese characters make it a lot easier since in Chinese (an analytic language) character morpheme breaks make defining exactly what is to be counted a lot easier.)
Friday, January 27, 2006
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment