How can I measure how simple a text is? One way is to count unique words. Simple metrics like the Flesch readability formula only provide a very rough rule of thumb. What about comparing a text with similar texts that you already know are simple?
Texts from graded readers like the Oxford Bookwork series provide a nice baseline for comparison, but they are copyrighted. Maybe articles in the Simplified English Wikipedia could be used, although when I took look there weren't many articles yet and some people were writing their articles with Ogden's Basic English which actually distorts the English language sometimes, not a good idea.
The vocab profiler can be used to do the comparison. Start with a corpus of simplified texts and compare the profile on these simplified texts with authentic texts from newspapers.
Anyway, simplified vs. authentic texts is a very murky area. What is simplified? Don't you lose information with simplified texts? Next, I have to create profiles for some simplified texts and compare them with the authentic text profiles I already have.