That does sound like an issue. Can you open a JIRA issue for it?
Thanks,
Grant
On Mar 12, 2009, at 5:55 AM, iMe wrote:
I spotted an unexepcted behavior when using the StandardAnalyzer.
This analyzer uses the StandardTokenizer which javadoc states:
Splits words at hyphens, unless there's a number in the token, in
which case
the whole token is interpreted as a product number and is not split.
But looking to my index with luke, I saw that my product reference
AB-CD-1234 is split in 3 token AB, CD and 123 while I was expected the
tokenizer to keep it as a whole.
So its look like the StandardTokenizer does not work as is should.
Am I right ?
I had a deeper look, and found out (
https://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
here ) the jflex source used to generate the StandardTokenizerImpl.
And here is how "product numbers" are defined: (P being the
punctuation:
"_", "-", "/", "." and ",")
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM = ({ALPHANUM} {P} {HAS_DIGIT}
| {HAS_DIGIT} {P} {ALPHANUM}
| {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
| {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P}
{HAS_DIGIT})+
| {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P}
{ALPHANUM})+)
I am not a jflex expert, but it looks like the {ALPHANUM} ({P}
{ALPHANUM}
{P} {HAS_DIGIT}) is missing ?
As well as all other patterns containing two digits or two alpha
separated
by a punctuation. :
--
View this message in context:
http://www.nabble.com/StandardTokenizer-issue---tp22471475p22471475.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org