On Aug 29, 2009, at 10:42 PM, Matthew Talbert wrote:



If backward compatibility is ok to be broken, I suggest changing from
StandardAnalyzer to SimpleAnalyzer. It does not have stopwords to begin with
and will index the text without the silly transformations that the
StandardAnalyzer does.

Just out of curiosity, what are the silly transformations?

See: http://www.gossamer-threads.com/lists/lucene/java-user/80838

Basically, the StandardAnalyzer has a tokenizer that recognizes complex patterns to determine word boundaries. By and large, these transformations (e-mail addresses, host names, ...) won't be found in the Bible. Maybe in commentaries and gen books. But there is a cost of running an expensive analyzer that generally does nothing and occasionally does something unexpected.

The SimpleAnalyzer merely looks for word boundaries that are appropriate for English. It is not appropriate for languages that have different punctuation or word boundaries. There are a bunch of contributed analyzers for different languages (e.g. Thai, Chinese) that are more appropriate for them. In the upcoming Lucene 3.0 release there will be analyzers for more languages, including Farsi. These could be ported from Java to C++ if they are valuable to SWORD.

Another area that contributors to JSword have found useful: stemming. This is something that is an option on the JSword analyzers. There are a number of languages for which there are stemmers.

In Him,
        DM


_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to