On 10/13/2012 05:23 PM, Chris Little wrote:
On 10/13/2012 6:12 AM, Daniel Owens wrote:
Thanks, Chris. I had not thought of the latter solution, but that is
what we need. This raises a fundamental question: how will front-ends
find the right lexical entry?

Currently, according to my understanding, a conf file may include
Feature=HebrewDef. To distinguish Hebrew from Aramaic, I suggest the
following value also be allowed: Feature=AramaicDef. Then front-ends
will be able to find entries in the correct language.

HebrewDef indicates that a lexicon module is indexed by Strong's numbers. Everything you've said so far indicates to me that you aren't using Strong's numbers at all, so do not use Feature=HebrewDef. Also, there should not ever be a Feature=AramaicDef since Aramaic Strong's numbers are not distinguished from Hebrew.

Yes, I am not using Strong's numbers at all. I am hoping to help SWORD move away from its dependence upon Strong's, both the module and the numbers. It never occurred to me when someone told me to use Feature=HebrewDef that it was reserved only for Strong's numbers. But if that is what it does, then I understand why my suggestion to add AramaicDef should be discarded. No problem, though in my defense the nomenclature is misleading (perhaps it should be called StrongsHebrewDef?).

I think it would probably be helpful if you could enumerate the set of modules you propose to create:

a Bible (just one? more than one?)
a lexicon? separate Hebrew & Aramaic lexica?
a morphology database? separate Hebrew & Aramaic databases?

I am trying to see that there are respectable free or low cost options for study of the Bible in Greek, Hebrew, and Aramaic. I am trying to envision the big picture, some of which is already filled in, and then work toward filling in the rest. In the end I would like to see the following modules.

For Greek:
- Bible Texts: MorphGNT (Greek lemma, not Strong's numbers); other future texts with Greek lemma, other current and future texts with Strong's numbers (Tischendorf, WH, KJV, etc.)
- Lexica: Strong's Greek; Abbott-Smith (Greek lemma)

For Hebrew:
- Bible Texts: WHM (Hebrew lemma); OSMHB (currently has Strong's numbers, but eventually I hope will have some other more up-to-date lemmatization) - Lexica: Strong's Hebrew; BDB Hebrew (Hebrew lemma); BDB Aramaic (Aramaic lemma)

My guess is that you are advocating a Feature value that indicates "this lexicon module contains words in language X, indexed by lemma/word". I would absolutely be supportive of adding this, but we currently have nothing comparable in use. I would advocate (Greek|Hebrew|Aramaic|...)WordDef for the value.

That makes sense to me. That's what I thought I was advocating. :) Just to make sure we care communicating, though, you mean Feature=GreekWordDef, etc., right?

But lemmatization can vary somewhat in the details within a language.
How could we include mappings between lemmatization? That way we could
map between lemmatizations so a text using Strong's numbers could look
up words in a lexicon keyed to Greek, Hebrew or Aramaic and vice versa.
Perhaps a simple mapping format could be the following:

The file StrongsGreek2AbbottSmith.map could contain:
G1=α
G2=Ἀαρών
G3=Ἀβαδδών
etc.

Frontends could use these mappings to find the correct lexical entry. So
A lookup from KJV could then find the relevant entry in AbbottSmith. And
with a similar mapping MorphGNT2StrongsGreek.map a lookup from MorphGNT
could find the correct entry in Strongs, if that is the default Greek
Lexicon for the front-end.

I use Greek because I have the data ready at hand, but this method would
be even more important for Hebrew. I was testing with BibleTime and
found that only some of the lemma in WHM would find their way to the
correct BDB entry. This is because their lemmatizations are different.
Providing for a mapping would allow us to resolve those conflicts for
the user. Also, the OSMHB module could find entries in BDB keyed to
Hebrew, and the WHM could find entries in BDB or Strongs. I expect this
mapping would need to happen at the engine level.

Is that a reasonable solution? Or does someone have a better idea?

I believe that mapping to/from Strong's numbers is not one-to-one, but many-to-many. We currently allow lookups based on lemmata by keying lexica to lemmata. A lexicon can have multiple keys point to a single entry.

Yes, mapping between them is complicated and not all cases will work exactly right. Yes, multiple lexical keys *sort of* point to a single entry. In practice they point to text that says "@LINK" and the other key but does not link to the actual entry. For example, I created a lexicon with Hebrew and Strong's keys, and the result for H1 was:

H0001 @LINK אָב

Lookup *should* be seamless, that is, the user should not have to find the entry manually. Maybe in some odd cases the user would need to scroll up or down an entry or two, but the above example would require scrolling ~8600 entries away. And certainly there should not be empty entries like what is above.

I am simply advocating a solution that will hide some of the guts of the data and just work for the user. Let Strong's and KJV be keyed to Strong's numbers, MorphGNT, WHM, Abbott-Smith, BDB, etc. keyed to natural language lemma. But find a way to connect them seamlessly.

Ultimately, it would be very nice to write a stemmer for each of the relevant languages, index lexica by stem (or facilitate searches by stem), and thus do away with some of the need to pre-lemmatize texts. I don't know whether stemming algorithms exist for Greek & Hebrew or necessarily how reliable they would be, but it's an area worth some research.

--Chris
That task is beyond me, and as far as I know it is standard practice to pre-lemmatize texts. And we have the texts pre-lemmatized already. The real use case challenge at the moment is getting from those texts to the proper lexical entry. Currently to do this reliably in SWORD you have to stay within a lemmatization silo. In other words, working with Strong's texts you can get to a Strong's lexical entry very reliably. But move outside of that and it is inconsistent. I am just trying to find some solution. It does not need to be mine, but it needs to work. My proposal may not be the best solution, but it would save having to add foreign lexical keys (i.e. Strong's numbers) to lexica like Abbott-Smith or BDB.

Daniel

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to