Re: [sword-devel] indexed search discrepancy

DM Smith Sun, 30 Aug 2009 07:19:41 -0700


On Aug 29, 2009, at 10:42 PM, Matthew Talbert wrote:

If backward compatibility is ok to be broken, I suggest changing from
StandardAnalyzer to SimpleAnalyzer. It does not have stopwords tobegin with
and will index the text without the silly transformations that the
StandardAnalyzer does.


Just out of curiosity, what are the silly transformations?


See: http://www.gossamer-threads.com/lists/lucene/java-user/80838

Basically, the StandardAnalyzer has a tokenizer that recognizescomplex patterns to determine word boundaries. By and large, thesetransformations (e-mail addresses, host names, ...) won't be found inthe Bible. Maybe in commentaries and gen books. But there is a cost ofrunning an expensive analyzer that generally does nothing andoccasionally does something unexpected.

The SimpleAnalyzer merely looks for word boundaries that areappropriate for English. It is not appropriate for languages that havedifferent punctuation or word boundaries. There are a bunch ofcontributed analyzers for different languages (e.g. Thai, Chinese)that are more appropriate for them. In the upcoming Lucene 3.0 releasethere will be analyzers for more languages, including Farsi. Thesecould be ported from Java to C++ if they are valuable to SWORD.

Another area that contributors to JSword have found useful: stemming.This is something that is an option on the JSword analyzers. There area number of languages for which there are stemmers.


In Him,
        DM


_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] indexed search discrepancy

Reply via email to