On 10/13/2012 6:12 AM, Daniel Owens wrote:
Thanks, Chris. I had not thought of the latter solution, but that is
what we need. This raises a fundamental question: how will front-ends
find the right lexical entry?
Currently, according to my understanding, a conf file may include
Feature=HebrewDef. To distinguish Hebrew from Aramaic, I suggest the
following value also be allowed: Feature=AramaicDef. Then front-ends
will be able to find entries in the correct language.
HebrewDef indicates that a lexicon module is indexed by Strong's
numbers. Everything you've said so far indicates to me that you aren't
using Strong's numbers at all, so do not use Feature=HebrewDef. Also,
there should not ever be a Feature=AramaicDef since Aramaic Strong's
numbers are not distinguished from Hebrew.
I think it would probably be helpful if you could enumerate the set of
modules you propose to create:
a Bible (just one? more than one?)
a lexicon? separate Hebrew & Aramaic lexica?
a morphology database? separate Hebrew & Aramaic databases?
My guess is that you are advocating a Feature value that indicates "this
lexicon module contains words in language X, indexed by lemma/word". I
would absolutely be supportive of adding this, but we currently have
nothing comparable in use. I would advocate
(Greek|Hebrew|Aramaic|...)WordDef for the value.
But lemmatization can vary somewhat in the details within a language.
How could we include mappings between lemmatization? That way we could
map between lemmatizations so a text using Strong's numbers could look
up words in a lexicon keyed to Greek, Hebrew or Aramaic and vice versa.
Perhaps a simple mapping format could be the following:
The file StrongsGreek2AbbottSmith.map could contain:
G1=α
G2=Ἀαρών
G3=Ἀβαδδών
etc.
Frontends could use these mappings to find the correct lexical entry. So
A lookup from KJV could then find the relevant entry in AbbottSmith. And
with a similar mapping MorphGNT2StrongsGreek.map a lookup from MorphGNT
could find the correct entry in Strongs, if that is the default Greek
Lexicon for the front-end.
I use Greek because I have the data ready at hand, but this method would
be even more important for Hebrew. I was testing with BibleTime and
found that only some of the lemma in WHM would find their way to the
correct BDB entry. This is because their lemmatizations are different.
Providing for a mapping would allow us to resolve those conflicts for
the user. Also, the OSMHB module could find entries in BDB keyed to
Hebrew, and the WHM could find entries in BDB or Strongs. I expect this
mapping would need to happen at the engine level.
Is that a reasonable solution? Or does someone have a better idea?
I believe that mapping to/from Strong's numbers is not one-to-one, but
many-to-many. We currently allow lookups based on lemmata by keying
lexica to lemmata. A lexicon can have multiple keys point to a single entry.
Ultimately, it would be very nice to write a stemmer for each of the
relevant languages, index lexica by stem (or facilitate searches by
stem), and thus do away with some of the need to pre-lemmatize texts. I
don't know whether stemming algorithms exist for Greek & Hebrew or
necessarily how reliable they would be, but it's an area worth some
research.
--Chris
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page