On 10/13/2012 02:43 AM, Chris Little wrote:
On 10/12/2012 1:40 PM, Daniel Owens wrote:
The markup would look like this:

Hebrew (from Deuteronomy): <w lemma="whmlemma:Hאבד"
morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>

Aramaic (from Jeremiah): <w lemma="whmlemma:Aאבד"
morph="whmmorph:some_value">יֵאבַ֧דוּ</w>

The main problem I see is that other front-ends may not follow the
process of looking for G or H and then stripping the character before
looking up the entry.

Could we come to a consensus on this?

I would recommend taking a look at the markup used in the MorphGNT module, which also employs real lemmata rather in addition to lemmata coded as Strong's numbers:

<w morph="robinson:N-NSF" lemma="lemma.Strong:βίβλος strong:G0976">Βίβλος</w>

You should begin the workID for real lemmata with "lemma.", and follow this with some identifier indicating the lemmatization scheme. We have some code in Sword that looks for "lemma." and will treat the value as a real word rather than a Strong's number or something else. I think OSIS validation may complain about the workIDs of the form "lemma.system", but that's a schema bug and you should ignore it.

As for the value of the lemma itself ([HA]אבד in your example above), you choose the form specified in the system you are employing. So, if MORPH employs its own lemmatization system and that takes the form @<word> for Hebrew and %<word> for Aramaic, then use those forms, e.g.:

<w lemma="lemma.whm:@אבד"> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>

The alternative is to distinguish the languages via the workID:

<w lemma="lemma.whm.he:אבד"> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>

If you aren't creating a lexical resource that indexes based on @- and %- prefixed lemmata, then I don't see how the former option is useful and would recommend the latter. The latter option will allow lookups in word-indexed lexica.

--Chris

Thanks, Chris. I had not thought of the latter solution, but that is what we need. This raises a fundamental question: how will front-ends find the right lexical entry?

Currently, according to my understanding, a conf file may include Feature=HebrewDef. To distinguish Hebrew from Aramaic, I suggest the following value also be allowed: Feature=AramaicDef. Then front-ends will be able to find entries in the correct language.

But lemmatization can vary somewhat in the details within a language. How could we include mappings between lemmatization? That way we could map between lemmatizations so a text using Strong's numbers could look up words in a lexicon keyed to Greek, Hebrew or Aramaic and vice versa. Perhaps a simple mapping format could be the following:

The file StrongsGreek2AbbottSmith.map could contain:
G1=α
G2=Ἀαρών
G3=Ἀβαδδών
etc.

Frontends could use these mappings to find the correct lexical entry. So A lookup from KJV could then find the relevant entry in AbbottSmith. And with a similar mapping MorphGNT2StrongsGreek.map a lookup from MorphGNT could find the correct entry in Strongs, if that is the default Greek Lexicon for the front-end.

I use Greek because I have the data ready at hand, but this method would be even more important for Hebrew. I was testing with BibleTime and found that only some of the lemma in WHM would find their way to the correct BDB entry. This is because their lemmatizations are different. Providing for a mapping would allow us to resolve those conflicts for the user. Also, the OSMHB module could find entries in BDB keyed to Hebrew, and the WHM could find entries in BDB or Strongs. I expect this mapping would need to happen at the engine level.

Is that a reasonable solution? Or does someone have a better idea?

Daniel

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to