On 10/13/2012 05:23 PM, Chris Little wrote:
On 10/13/2012 6:12 AM, Daniel Owens wrote:
Thanks, Chris. I had not thought of the latter solution, but that is
what we need. This raises a fundamental question: how will front-ends
find the right lexical entry?
Currently, according to my understanding, a conf file may include
Feature=HebrewDef. To distinguish Hebrew from Aramaic, I suggest the
following value also be allowed: Feature=AramaicDef. Then front-ends
will be able to find entries in the correct language.
HebrewDef indicates that a lexicon module is indexed by Strong's
numbers. Everything you've said so far indicates to me that you aren't
using Strong's numbers at all, so do not use Feature=HebrewDef. Also,
there should not ever be a Feature=AramaicDef since Aramaic Strong's
numbers are not distinguished from Hebrew.
Yes, I am not using Strong's numbers at all. I am hoping to help SWORD
move away from its dependence upon Strong's, both the module and the
numbers. It never occurred to me when someone told me to use
Feature=HebrewDef that it was reserved only for Strong's numbers. But if
that is what it does, then I understand why my suggestion to add
AramaicDef should be discarded. No problem, though in my defense the
nomenclature is misleading (perhaps it should be called StrongsHebrewDef?).
I think it would probably be helpful if you could enumerate the set of
modules you propose to create:
a Bible (just one? more than one?)
a lexicon? separate Hebrew & Aramaic lexica?
a morphology database? separate Hebrew & Aramaic databases?
I am trying to see that there are respectable free or low cost options
for study of the Bible in Greek, Hebrew, and Aramaic. I am trying to
envision the big picture, some of which is already filled in, and then
work toward filling in the rest. In the end I would like to see the
following modules.
For Greek:
- Bible Texts: MorphGNT (Greek lemma, not Strong's numbers); other
future texts with Greek lemma, other current and future texts with
Strong's numbers (Tischendorf, WH, KJV, etc.)
- Lexica: Strong's Greek; Abbott-Smith (Greek lemma)
For Hebrew:
- Bible Texts: WHM (Hebrew lemma); OSMHB (currently has Strong's
numbers, but eventually I hope will have some other more up-to-date
lemmatization)
- Lexica: Strong's Hebrew; BDB Hebrew (Hebrew lemma); BDB Aramaic
(Aramaic lemma)
My guess is that you are advocating a Feature value that indicates
"this lexicon module contains words in language X, indexed by
lemma/word". I would absolutely be supportive of adding this, but we
currently have nothing comparable in use. I would advocate
(Greek|Hebrew|Aramaic|...)WordDef for the value.
That makes sense to me. That's what I thought I was advocating. :) Just
to make sure we care communicating, though, you mean
Feature=GreekWordDef, etc., right?
But lemmatization can vary somewhat in the details within a language.
How could we include mappings between lemmatization? That way we could
map between lemmatizations so a text using Strong's numbers could look
up words in a lexicon keyed to Greek, Hebrew or Aramaic and vice versa.
Perhaps a simple mapping format could be the following:
The file StrongsGreek2AbbottSmith.map could contain:
G1=α
G2=Ἀαρών
G3=Ἀβαδδών
etc.
Frontends could use these mappings to find the correct lexical entry. So
A lookup from KJV could then find the relevant entry in AbbottSmith. And
with a similar mapping MorphGNT2StrongsGreek.map a lookup from MorphGNT
could find the correct entry in Strongs, if that is the default Greek
Lexicon for the front-end.
I use Greek because I have the data ready at hand, but this method would
be even more important for Hebrew. I was testing with BibleTime and
found that only some of the lemma in WHM would find their way to the
correct BDB entry. This is because their lemmatizations are different.
Providing for a mapping would allow us to resolve those conflicts for
the user. Also, the OSMHB module could find entries in BDB keyed to
Hebrew, and the WHM could find entries in BDB or Strongs. I expect this
mapping would need to happen at the engine level.
Is that a reasonable solution? Or does someone have a better idea?
I believe that mapping to/from Strong's numbers is not one-to-one, but
many-to-many. We currently allow lookups based on lemmata by keying
lexica to lemmata. A lexicon can have multiple keys point to a single
entry.
Yes, mapping between them is complicated and not all cases will work
exactly right. Yes, multiple lexical keys *sort of* point to a single
entry. In practice they point to text that says "@LINK" and the other
key but does not link to the actual entry. For example, I created a
lexicon with Hebrew and Strong's keys, and the result for H1 was:
H0001 @LINK אָב
Lookup *should* be seamless, that is, the user should not have to find
the entry manually. Maybe in some odd cases the user would need to
scroll up or down an entry or two, but the above example would require
scrolling ~8600 entries away. And certainly there should not be empty
entries like what is above.
I am simply advocating a solution that will hide some of the guts of the
data and just work for the user. Let Strong's and KJV be keyed to
Strong's numbers, MorphGNT, WHM, Abbott-Smith, BDB, etc. keyed to
natural language lemma. But find a way to connect them seamlessly.
Ultimately, it would be very nice to write a stemmer for each of the
relevant languages, index lexica by stem (or facilitate searches by
stem), and thus do away with some of the need to pre-lemmatize texts.
I don't know whether stemming algorithms exist for Greek & Hebrew or
necessarily how reliable they would be, but it's an area worth some
research.
--Chris
That task is beyond me, and as far as I know it is standard practice to
pre-lemmatize texts. And we have the texts pre-lemmatized already. The
real use case challenge at the moment is getting from those texts to the
proper lexical entry. Currently to do this reliably in SWORD you have to
stay within a lemmatization silo. In other words, working with Strong's
texts you can get to a Strong's lexical entry very reliably. But move
outside of that and it is inconsistent. I am just trying to find some
solution. It does not need to be mine, but it needs to work. My proposal
may not be the best solution, but it would save having to add foreign
lexical keys (i.e. Strong's numbers) to lexica like Abbott-Smith or BDB.
Daniel
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page