Re: [sword-devel] seeking consensus on OSIS lemma best practice

Daniel Owens Sat, 13 Oct 2012 15:17:36 -0700

On 10/13/2012 05:23 PM, Chris Little wrote:

On 10/13/2012 6:12 AM, Daniel Owens wrote:
Thanks, Chris. I had not thought of the latter solution, but that is
what we need. This raises a fundamental question: how will front-ends
find the right lexical entry?
Currently, according to my understanding, a conf file may include
Feature=HebrewDef. To distinguish Hebrew from Aramaic, I suggest the
following value also be allowed: Feature=AramaicDef. Then front-ends
will be able to find entries in the correct language.
HebrewDef indicates that a lexicon module is indexed by Strong'snumbers. Everything you've said so far indicates to me that you aren'tusing Strong's numbers at all, so do not use Feature=HebrewDef. Also,there should not ever be a Feature=AramaicDef since Aramaic Strong'snumbers are not distinguished from Hebrew.

Yes, I am not using Strong's numbers at all. I am hoping to help SWORDmove away from its dependence upon Strong's, both the module and thenumbers. It never occurred to me when someone told me to useFeature=HebrewDef that it was reserved only for Strong's numbers. But ifthat is what it does, then I understand why my suggestion to addAramaicDef should be discarded. No problem, though in my defense thenomenclature is misleading (perhaps it should be called StrongsHebrewDef?).

I think it would probably be helpful if you could enumerate the set ofmodules you propose to create:
a Bible (just one? more than one?)
a lexicon? separate Hebrew & Aramaic lexica?
a morphology database? separate Hebrew & Aramaic databases?

I am trying to see that there are respectable free or low cost optionsfor study of the Bible in Greek, Hebrew, and Aramaic. I am trying toenvision the big picture, some of which is already filled in, and thenwork toward filling in the rest. In the end I would like to see thefollowing modules.


For Greek:

- Bible Texts: MorphGNT (Greek lemma, not Strong's numbers); otherfuture texts with Greek lemma, other current and future texts withStrong's numbers (Tischendorf, WH, KJV, etc.)

- Lexica: Strong's Greek; Abbott-Smith (Greek lemma)

For Hebrew:

- Bible Texts: WHM (Hebrew lemma); OSMHB (currently has Strong'snumbers, but eventually I hope will have some other more up-to-datelemmatization)- Lexica: Strong's Hebrew; BDB Hebrew (Hebrew lemma); BDB Aramaic(Aramaic lemma)

My guess is that you are advocating a Feature value that indicates"this lexicon module contains words in language X, indexed bylemma/word". I would absolutely be supportive of adding this, but wecurrently have nothing comparable in use. I would advocate(Greek|Hebrew|Aramaic|...)WordDef for the value.

That makes sense to me. That's what I thought I was advocating. :) Justto make sure we care communicating, though, you meanFeature=GreekWordDef, etc., right?

But lemmatization can vary somewhat in the details within a language.
How could we include mappings between lemmatization? That way we could
map between lemmatizations so a text using Strong's numbers could look
up words in a lexicon keyed to Greek, Hebrew or Aramaic and vice versa.
Perhaps a simple mapping format could be the following:

The file StrongsGreek2AbbottSmith.map could contain:
G1=α
G2=Ἀαρών
G3=Ἀβαδδών
etc.

Frontends could use these mappings to find the correct lexical entry. So
A lookup from KJV could then find the relevant entry in AbbottSmith. And
with a similar mapping MorphGNT2StrongsGreek.map a lookup from MorphGNT
could find the correct entry in Strongs, if that is the default Greek
Lexicon for the front-end.

I use Greek because I have the data ready at hand, but this method would
be even more important for Hebrew. I was testing with BibleTime and
found that only some of the lemma in WHM would find their way to the
correct BDB entry. This is because their lemmatizations are different.
Providing for a mapping would allow us to resolve those conflicts for
the user. Also, the OSMHB module could find entries in BDB keyed to
Hebrew, and the WHM could find entries in BDB or Strongs. I expect this
mapping would need to happen at the engine level.

Is that a reasonable solution? Or does someone have a better idea?

I believe that mapping to/from Strong's numbers is not one-to-one, butmany-to-many. We currently allow lookups based on lemmata by keyinglexica to lemmata. A lexicon can have multiple keys point to a singleentry.

Yes, mapping between them is complicated and not all cases will workexactly right. Yes, multiple lexical keys *sort of* point to a singleentry. In practice they point to text that says "@LINK" and the otherkey but does not link to the actual entry. For example, I created alexicon with Hebrew and Strong's keys, and the result for H1 was:


H0001 @LINK אָב

Lookup *should* be seamless, that is, the user should not have to findthe entry manually. Maybe in some odd cases the user would need toscroll up or down an entry or two, but the above example would requirescrolling ~8600 entries away. And certainly there should not be emptyentries like what is above.

I am simply advocating a solution that will hide some of the guts of thedata and just work for the user. Let Strong's and KJV be keyed toStrong's numbers, MorphGNT, WHM, Abbott-Smith, BDB, etc. keyed tonatural language lemma. But find a way to connect them seamlessly.

Ultimately, it would be very nice to write a stemmer for each of therelevant languages, index lexica by stem (or facilitate searches bystem), and thus do away with some of the need to pre-lemmatize texts.I don't know whether stemming algorithms exist for Greek & Hebrew ornecessarily how reliable they would be, but it's an area worth someresearch.
--Chris

That task is beyond me, and as far as I know it is standard practice topre-lemmatize texts. And we have the texts pre-lemmatized already. Thereal use case challenge at the moment is getting from those texts to theproper lexical entry. Currently to do this reliably in SWORD you have tostay within a lemmatization silo. In other words, working with Strong'stexts you can get to a Strong's lexical entry very reliably. But moveoutside of that and it is inconsistent. I am just trying to find somesolution. It does not need to be mine, but it needs to work. My proposalmay not be the best solution, but it would save having to add foreignlexical keys (i.e. Strong's numbers) to lexica like Abbott-Smith or BDB.


Daniel

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] seeking consensus on OSIS lemma best practice

Reply via email to