James Tauber : Current MorphGNT Work

For the last few months, I've been making corrections to MorphGNT by attempting to merge an English translation (NASB) marked with Strong's numbers with my database. Although it's a tedious process, it's revealing numerous errors.

When James Strong compiled his concordance, he assigned a number to every lemma in the underlying Greek text of the King James Version. Other translations are often made available annotated with these Strong's numbers. Zack Hubert provided me with an electronic text of the NASB translation with Strong's numbers which I converted to something looking like this:

010101 record 976 010101 genealogy 1078 010101 Jesus 2424 010101 Messiah 5547 010101 son 5207 010101 son 5207 010101 Abraham 11

The first column is the book, chapter and verse, the second column is the English word as it appears in the NASB translation and the third column is the Strong's number. Note that not all words are included.

I then found an electronic text of Strong's lexicon and stripped out the formatting and the definitions to just get a list of Strong's numbers with a transliteration of the Greek lemma:

1 a 2 Aaron 3 Abaddon 4 abares 5 Abba 6 Abel 7 Abia 8 Abiathar 9 Abilene 10 Abioud

Finally I took my MorphGNT database and extracted the lemmata:

010101 βίβλος 010101 γένεσις 010101 Ἰησοῦς 010101 Χριστός 010101 υἱός 010101 Δαυίδ 010101 υἱός 010101 Ἀβραάμ

I then wrote a Python program that attempts to merge the first and third files on the basis of the second. Note that the transliterations in Strong's lexicon don't have accents and there is ambiguity too (both epsilon and eta go to 'e'). That's a fairly straightforward part of the join, however, because it can be automated by the script.

The real challenge comes because:

NASB versification isn't the same as the MorphGNT Greek text
the text underlying the NASB is not the same critical text as that of MorphGNT
there are errors in each of the files
there are spelling differences
there are differences in the granularity of the lemmata

So my program simply indicates whenever it had trouble performing a match and I have to either:

correct my MorphGNT lemma
correct (or merely change to my lemma conventions) the Strong's lexicon file
correct the NASB-Strong file
change the verse numbering in the NASB-Strong file
comment out a particular word that appears in the text underlying the NASB but not the MorphGNT text

There were initially thousands of exceptions that each required one of these actions. After a number of months, I now have one thousand left. It takes me about 4 hours to make 100 corrections so I still have a little way to go.

When I'm done, I'll release a new version of MorphGNT with the lemma errors that this task revealed corrected.