James Tauber : GNT Verse Coverage Statistics

It is fairly common, in the context of learning vocabulary for a particular corpus like the Greek New Testament, to talk about what proportion of the text one could read if one learnt the top N words.

I even produced such a table for the GNT back in 1996—see New Testament Vocabulary Count Statistics (via Internet Archive's Wayback Machine).

But these sort of numbers are highly misleading because they don't tell you what proportion of sentences (or as a rough proxy in the GNT case: verses) you could read, only what proportion of words.

Reading theorists have suggested that you need to know 95% of the vocabulary of a sentence to comprehend it. So a more interesting list of statistics would be how many verses can one understand 95% of the vocab of if one know a certain number of words. Of course, there's a lot more to reading comprehension than knowing the vocab. But it was enough for me to decide to write some code yesterday afternoon to run against my MorphGNT database.

To first of all give you a flavour in the specific before moving to the final numbers, consider John 3.16, which is, from a vocabulary point of view, a very easy verse to read.

To be able to read 50% of it, you only need to know the top 28 lexemes in the GNT. To read 75% you only need the top 85 (up to κόσμος). With the top 204 lexemes, you can read 90% of the verse and only a few more: up to 236 (αἰώνιος) gives you the 95%. The only word you would not have come across learning the top 236 words would be μονογενής but even that is in the top 1,200.

This example does highlight some of the shortcomings of this sort of analysis. There's no consideration of necessary knowledge of morphology, syntax, idioms, etc. Nor for the fact that the meaning of something like μονογενής is fairly easy to guess from knowledge of more common words. But I still think it's much more useful than the pure word coverage statistics I linked to earlier.

So let's actually run the numbers on the complete GNT. If you know the top N words, how many verses could you understand 50% of, 75%, 90% or 95% of...

vocab / coverage	any	50%	75%	90%	95%	100%
100	99.9%	91.3%	24.4%	2.1%	0.6%	0.4%
200	99.9%	96.9%	51.8%	9.8%	3.4%	2.5%
500	99.9%	99.1%	82.3%	36.5%	18.0%	13.9%
1,000	100.0%	99.7%	93.6%	62.3%	37.3%	30.1%
1,500	100.0%	99.8%	97.2%	76.3%	53.5%	44.8%
2,000	100.0%	99.9%	98.4%	85.1%	65.5%	56.5%
3,000	100.0%	100.0%	99.4%	93.6%	81.0%	74.1%
4,000	100.0%	100.0%	99.7%	97.4%	90.0%	85.5%
5,000	100.0%	100.0%	100.0%	99.4%	96.5%	94.5%
all	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%

What this means is purely from a vocabulary point of view if you knew the top 1000 lexemes, then 37.3% of verses in the GNT would be 95% familiar to you.

I should emphasis that learning vocabulary in frequency order isn't necessarily the fastest way to get this proportion of readable verses up. I blogged about this fact three years ago, see Programmed Vocabulary Learning as a Travelling Salesman Problem, for example.