On Aug 30, 2009, at 4:15 PM, Matthew Talbert <ransom1...@gmail.com>
wrote:
Just out of curiosity, what are the silly transformations?
See: http://www.gossamer-threads.com/lists/lucene/java-user/80838
Basically, the StandardAnalyzer has a tokenizer that recognizes
complex
patterns to determine word boundaries. By and large, these
transformations
(e-mail addresses, host names, ...) won't be found in the Bible.
Maybe in
commentaries and gen books. But there is a cost of running an
expensive
analyzer that generally does nothing and occasionally does something
unexpected.
The SimpleAnalyzer merely looks for word boundaries that are
appropriate for
English. It is not appropriate for languages that have different
punctuation
or word boundaries. There are a bunch of contributed analyzers for
different
languages (e.g. Thai, Chinese) that are more appropriate for them.
In the
upcoming Lucene 3.0 release there will be analyzers for more
languages,
including Farsi. These could be ported from Java to C++ if they are
valuable
to SWORD.
But the StandardAnalyzer is no more appropriate for non-English,
correct?
It is no more appropriate. But it may be less.
So unless we have the non-English analyzers, then there is no
value in using the StandardAnalyzer over the simple?
Even with the non-English analyzers there is no value in the
StandardAnalyzer over the Simple.
clucene is still
trying to become compatible with Lucene 2 (I think it's largely done,
but not released yet). If these analyzers are for Lucene 3.0
Most are part of 2.x.
is it
possible that it would take substantial work to port them to clucene
which is still stuck in Lucene 1 compatibility?
I don't think the effort is much harder than doing an initial port to
the same level. A tokenizer merely takes an input stream and breaks it
up into tokens and returns a token each time next(...) is called. What
differs between the releases is how next is implemented. The algorithm
is the same. (BTW, I am a Lucene contributor wrt tokenizers so my
point is not merely academic;)
In His Service,
DM
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page