Re: [sword-devel] seeking consensus on OSIS lemma best practice

Daniel Owens Sat, 13 Oct 2012 06:14:24 -0700

On 10/13/2012 02:43 AM, Chris Little wrote:

On 10/12/2012 1:40 PM, Daniel Owens wrote:
The markup would look like this:
Hebrew (from Deuteronomy): <w lemma="whmlemma:Hאבד"
morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>

Aramaic (from Jeremiah): <w lemma="whmlemma:Aאבד"
morph="whmmorph:some_value">יֵאבַ֧דוּ</w>

The main problem I see is that other front-ends may not follow the
process of looking for G or H and then stripping the character before
looking up the entry.

Could we come to a consensus on this?
I would recommend taking a look at the markup used in the MorphGNTmodule, which also employs real lemmata rather in addition to lemmatacoded as Strong's numbers:
<w morph="robinson:N-NSF" lemma="lemma.Strong:βίβλοςstrong:G0976">Βίβλος</w>
You should begin the workID for real lemmata with "lemma.", and followthis with some identifier indicating the lemmatization scheme. We havesome code in Sword that looks for "lemma." and will treat the value asa real word rather than a Strong's number or something else. I thinkOSIS validation may complain about the workIDs of the form"lemma.system", but that's a schema bug and you should ignore it.
As for the value of the lemma itself ([HA]אבד in your example above),you choose the form specified in the system you are employing. So, ifMORPH employs its own lemmatization system and that takes the form@<word> for Hebrew and %<word> for Aramaic, then use those forms, e.g.:
<w lemma="lemma.whm:@אבד"> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>

The alternative is to distinguish the languages via the workID:

<w lemma="lemma.whm.he:אבד"> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>
If you aren't creating a lexical resource that indexes based on @- and%- prefixed lemmata, then I don't see how the former option is usefuland would recommend the latter. The latter option will allow lookupsin word-indexed lexica.
--Chris

Thanks, Chris. I had not thought of the latter solution, but that iswhat we need. This raises a fundamental question: how will front-endsfind the right lexical entry?

Currently, according to my understanding, a conf file may includeFeature=HebrewDef. To distinguish Hebrew from Aramaic, I suggest thefollowing value also be allowed: Feature=AramaicDef. Then front-endswill be able to find entries in the correct language.

But lemmatization can vary somewhat in the details within a language.How could we include mappings between lemmatization? That way we couldmap between lemmatizations so a text using Strong's numbers could lookup words in a lexicon keyed to Greek, Hebrew or Aramaic and vice versa.Perhaps a simple mapping format could be the following:


The file StrongsGreek2AbbottSmith.map could contain:
G1=α
G2=Ἀαρών
G3=Ἀβαδδών
etc.

Frontends could use these mappings to find the correct lexical entry. SoA lookup from KJV could then find the relevant entry in AbbottSmith. Andwith a similar mapping MorphGNT2StrongsGreek.map a lookup from MorphGNTcould find the correct entry in Strongs, if that is the default GreekLexicon for the front-end.

I use Greek because I have the data ready at hand, but this method wouldbe even more important for Hebrew. I was testing with BibleTime andfound that only some of the lemma in WHM would find their way to thecorrect BDB entry. This is because their lemmatizations are different.Providing for a mapping would allow us to resolve those conflicts forthe user. Also, the OSMHB module could find entries in BDB keyed toHebrew, and the WHM could find entries in BDB or Strongs. I expect thismapping would need to happen at the engine level.


Is that a reasonable solution? Or does someone have a better idea?

Daniel

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] seeking consensus on OSIS lemma best practice

Reply via email to