Thanks, your detailed instructions and example (and a little puzzling about how Java works, since i'm not a Java guy) produced some useful results, as well as (of course!) a few more questions related to running this with Thayer's.

1) there are various complaints: i'm not sure if they're significant
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring unexpected entry in orthodoxy of sMinimumVersion org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty entry in orthodoxy: CopyrightHolder= org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty entry in orthodoxy: CopyrightDate= org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty entry in orthodoxy: DistributionNotes= org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty entry in rsv: CopyrightNotes= org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty entry in rsv: CopyrightContactEmail= org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty entry in rsv: DistributionNotes= org.crosswire.jsword.book.filter.thml.THMLFilter(INFO): Could not fix it by cleaning tags: Illegal character or entity reference syntax.

2) the results from Thayer's seem to have lost the Greek characters. What's in the .imp file looks like some 8-bit chars
ωφελιμος
which i assume is some kind of representation of the Greek characters (haven't quite figured out what: doesn't seem to be UTF-8). But this winds up in the output as a string of '?'s.

3) entry 5207 (huios) produces bad XML: looks like a TDNT reference attribute in a sync tag doesn't get its terminating quote (after "8:400"?) and slash+angle bracket ending the sync are also missing: AV-son(s) 85, Son of Man +<sync type="Strongs" value="G444" /> 87 (<sync type="TDNT" value="8:400, 1210), Son of God The fault seems to exist in the .imp file as well (which has these <sync> tags embedded)

4) there are a number of bare "&" characters in the original which seem to get dropped in the output instead of replaced with &amp; (except for one in #5207, one might suppose because of the unterminated attribute/tag issue)

5) There are some issues with the synonym references around ampersands (whether related to #4 i can't tell): the .imp file has For Synonyms see entry <sync type="Strongs" value="G5811" /> & <sync type="Strongs" value="G5889" />
but the OSISified output has
<w lemma='strong:G5811'>
For Synonyms see entry </w><w lemma='strong:G5889'> </w>

Hope this feedback is helpful, and thanks again for the pointers. Unless there's a solution to the problem with the Greek characters, i'll have to fall back to parsing the .imp file by hand, since getting these out is important to me. By the way, what displays in the Sword Project for Thayer's lacks accents and breathing marks, though by comparison i see them in e-Sword's version: anyone happen to know why?

His,
Sean

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to