On Aug 30, 2009, at 4:15 PM, Matthew Talbert <ransom1...@gmail.com> wrote:

Just out of curiosity, what are the silly transformations?

See: http://www.gossamer-threads.com/lists/lucene/java-user/80838

Basically, the StandardAnalyzer has a tokenizer that recognizes complex patterns to determine word boundaries. By and large, these transformations (e-mail addresses, host names, ...) won't be found in the Bible. Maybe in commentaries and gen books. But there is a cost of running an expensive
analyzer that generally does nothing and occasionally does something
unexpected.

The SimpleAnalyzer merely looks for word boundaries that are appropriate for English. It is not appropriate for languages that have different punctuation or word boundaries. There are a bunch of contributed analyzers for different languages (e.g. Thai, Chinese) that are more appropriate for them. In the upcoming Lucene 3.0 release there will be analyzers for more languages, including Farsi. These could be ported from Java to C++ if they are valuable
to SWORD.

But the StandardAnalyzer is no more appropriate for non-English,
correct?

It is no more appropriate. But it may be less.

So unless we have the non-English analyzers, then there is no
value in using the StandardAnalyzer over the simple?

Even with the non-English analyzers there is no value in the StandardAnalyzer over the Simple.

clucene is still
trying to become compatible with Lucene 2 (I think it's largely done,
but not released yet). If these analyzers are for Lucene 3.0

Most are part of 2.x.

is it
possible that it would take substantial work to port them to clucene
which is still stuck in Lucene 1 compatibility?

I don't think the effort is much harder than doing an initial port to the same level. A tokenizer merely takes an input stream and breaks it up into tokens and returns a token each time next(...) is called. What differs between the releases is how next is implemented. The algorithm is the same. (BTW, I am a Lucene contributor wrt tokenizers so my point is not merely academic;)

In His Service,
DM
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to