For the last few months, I've been making corrections to MorphGNT by attempting to merge an English translation (NASB) marked with Strong's numbers with my database. Although it's a tedious process, it's revealing numerous errors.
When James Strong compiled his concordance, he assigned a number to every lemma in the underlying Greek text of the King James Version. Other translations are often made available annotated with these Strong's numbers. Zack Hubert provided me with an electronic text of the NASB translation with Strong's numbers which I converted to something looking like this:
010101 record 976 010101 genealogy 1078 010101 Jesus 2424 010101 Messiah 5547 010101 son 5207 010101 son 5207 010101 Abraham 11
The first column is the book, chapter and verse, the second column is the English word as it appears in the NASB translation and the third column is the Strong's number. Note that not all words are included.
I then found an electronic text of Strong's lexicon and stripped out the formatting and the definitions to just get a list of Strong's numbers with a transliteration of the Greek lemma:
1 a 2 Aaron 3 Abaddon 4 abares 5 Abba 6 Abel 7 Abia 8 Abiathar 9 Abilene 10 Abioud
Finally I took my MorphGNT database and extracted the lemmata:
010101 βίβλος 010101 γένεσις 010101 Ἰησοῦς 010101 Χριστός 010101 υἱός 010101 Δαυίδ 010101 υἱός 010101 Ἀβραάμ
I then wrote a Python program that attempts to merge the first and third files on the basis of the second. Note that the transliterations in Strong's lexicon don't have accents and there is ambiguity too (both epsilon and eta go to 'e'). That's a fairly straightforward part of the join, however, because it can be automated by the script.
The real challenge comes because:
So my program simply indicates whenever it had trouble performing a match and I have to either:
There were initially thousands of exceptions that each required one of these actions. After a number of months, I now have one thousand left. It takes me about 4 hours to make 100 corrections so I still have a little way to go.
When I'm done, I'll release a new version of MorphGNT with the lemma errors that this task revealed corrected.