So, as a side note to this thread, The Sahidic Bible is maintained at coptot.manuscriptroom.com:
http://coptot.manuscriptroom.com/transcribing?docID=1620025&userName=PUBLISHED and we regularly export from there and import into swordweb, which is used for their browser plugin (first link on Christian Askeland's wonder resource list for Coptic): https://sites.google.com/site/askelandchristian/copticlinks We don't index the text. They typically search with regex (and yes, they know about the {byte_count} anomaly with our regex search). -Troy On 04/26/2017 03:21 PM, DM Smith wrote: > Consider using Luke to analyze the constructed Lucene index. > See: https://code.google.com/archive/p/luke/ > I think you’ll need one that matches Lucene 1.9.1. Maybe 1.4.x. > > DM > > >> On Apr 26, 2017, at 3:48 PM, David Haslam <dfh...@googlemail.com >> <mailto:dfh...@googlemail.com>> wrote: >> >> If you examine the result preview pane in the Xiphos Advanced Search >> dialog, >> the problem becomes apparent. >> >> Most Coptic Unicode characters are not displayed correctly. >> >> >> >> The remainder seem to have been converted to U+FFFD REPLACEMENT >> CHARACTER. >> >> i.e. All these Coptic letters are basically not handled aright by >> this part >> of the software: >> >> U+2C81ⲁCOPTIC SMALL LETTER ALFA >> U+2C83ⲃCOPTIC SMALL LETTER VIDA >> U+2C85ⲅCOPTIC SMALL LETTER GAMMA >> U+2C87ⲇCOPTIC SMALL LETTER DALDA >> U+2C89ⲉCOPTIC SMALL LETTER EIE >> U+2C8BⲋCOPTIC SMALL LETTER SOU >> U+2C8DⲍCOPTIC SMALL LETTER ZATA >> U+2C8FⲏCOPTIC SMALL LETTER HATE >> U+2C91ⲑCOPTIC SMALL LETTER THETHE >> U+2C93ⲓCOPTIC SMALL LETTER IAUDA >> U+2C95ⲕCOPTIC SMALL LETTER KAPA >> U+2C97ⲗCOPTIC SMALL LETTER LAULA >> U+2C99ⲙCOPTIC SMALL LETTER MI >> U+2C9BⲛCOPTIC SMALL LETTER NI >> U+2C9DⲝCOPTIC SMALL LETTER KSI >> U+2C9FⲟCOPTIC SMALL LETTER O >> U+2CA1ⲡCOPTIC SMALL LETTER PI >> U+2CA3ⲣCOPTIC SMALL LETTER RO >> U+2CA5ⲥCOPTIC SMALL LETTER SIMA >> U+2CA7ⲧCOPTIC SMALL LETTER TAU >> U+2CA9ⲩCOPTIC SMALL LETTER UA >> U+2CABⲫCOPTIC SMALL LETTER FI >> U+2CADⲭCOPTIC SMALL LETTER KHI >> U+2CAFⲯCOPTIC SMALL LETTER PSI >> U+2CB1ⲱCOPTIC SMALL LETTER OOU >> U+2CC1ⳁCOPTIC SMALL LETTER SAMPI >> U+2CE8⳨COPTIC SYMBOL TAU RO >> >> Only the few Coptic letters in the block U+03E2 to U+03EF are displayed >> aright. >> >> It's no wonder that a search has so many spurious results if most of the >> search space has been squashed into Unicode replacement characters. >> >> I'm a Windows user, as most of you know already. >> Does the same thing happen in Xiphos under Linux? >> >> Is this an issue common to all SWORD based front-ends? >> The fact that we see similar results in PocketSword strongly suggests >> it is. >> >> Best regards, >> >> David >> >> >> >> -- >> View this message in context: >> http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657106.html >> Sent from the SWORD Dev mailing list archive at Nabble.com >> <http://Nabble.com>. >> >> _______________________________________________ >> sword-devel mailing list: sword-devel@crosswire.org >> <mailto:sword-devel@crosswire.org> >> http://www.crosswire.org/mailman/listinfo/sword-devel >> Instructions to unsubscribe/change your settings at above page > > > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page