Adrian Korten wrote:
g'day,
I've been wondering whether Thai would benefit from Lucene. Even if it does support utf-8, I doubt that Lucene supports Thai when no word breaks are provided. Even if it had smarts to handle Thai word-breaking like ICU, it would stumble over the Biblical words. Soooo, I haven't tried it.
Hopefully someone who actually knows what Lucene indexes will answer this better (and especially correct me if I'm wrong), but I expect Lucene would benefit Thai searching somewhat because it can search within words, not just on full words. (By 'words' here, I'm using the definition of "words" in French: anything with whitespace on both sides.)
We also probably could pass text through the ICU Thai word-break iterator to add surrounding whitespace before we hand it to the Lucene indexer. Anyone more knowledgable know whether that would work (on the Lucene side).
Is Lucene indexing primarily aimed at speeding up access to OSIS coded text files? Or would it also work with the other formats? I've kept the Thai modules in 'gbf' format to keep the file sizes down and search speeds slightly faster.
Indexing works on Bible modules, regardless of format. Commentaries should work too. GenBooks didn't work last I tried and I haven't tried any dictionaries.
--Chris
_______________________________________________ sword-devel mailing list sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel