The analyzer that is used to index a module must be used as the analyzer to parse the search request. The analyzer that Sword is currently using is for English. As part of the distribution of Lucene, there are analyzers for Russian and German. Also, in lucene's beta sandbox there are analyzers for a few other languages. If Sword uses different analyzer's for different modules then that will need to be stored against the module (kinda like defining a font for a particular module). If indexes are prebuilt and downloadable, then adding it to the conf is a consideration.
The analyzer consists of various filters (e.g. lowcase filter, stop word filter, stemming filter, punctuation filter) and a tokenizer. These do differ by locale, sometimes in subtle ways. One obvious way is that the "stop" word list (words that are not indexed) differ by language. So, pre-filtering the query would not work.
So the "lucene" way of doing things is to write analyzers and not pre-filters. The analyzers could be written using ICU.
Chris Little wrote:
Adrian Korten wrote:
g'day,
I've been wondering whether Thai would benefit from Lucene. Even if it does support utf-8, I doubt that Lucene supports Thai when no word breaks are provided. Even if it had smarts to handle Thai word-breaking like ICU, it would stumble over the Biblical words. Soooo, I haven't tried it.
Hopefully someone who actually knows what Lucene indexes will answer this better (and especially correct me if I'm wrong), but I expect Lucene would benefit Thai searching somewhat because it can search within words, not just on full words. (By 'words' here, I'm using the definition of "words" in French: anything with whitespace on both sides.)
We also probably could pass text through the ICU Thai word-break iterator to add surrounding whitespace before we hand it to the Lucene indexer. Anyone more knowledgable know whether that would work (on the Lucene side).
Is Lucene indexing primarily aimed at speeding up access to OSIS coded text files? Or would it also work with the other formats? I've kept the Thai modules in 'gbf' format to keep the file sizes down and search speeds slightly faster.
Indexing works on Bible modules, regardless of format. Commentaries should work too. GenBooks didn't work last I tried and I haven't tried any dictionaries.
_______________________________________________ sword-devel mailing list sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel