That does sound like an issue.  Can you open a JIRA issue for it?

Thanks,
Grant

On Mar 12, 2009, at 5:55 AM, iMe wrote:


I spotted an unexepcted behavior when using the StandardAnalyzer.


This analyzer uses the StandardTokenizer which javadoc states:


Splits words at hyphens, unless there's a number in the token, in which case
the whole token is interpreted as a product number and is not split.



But looking to my index with luke, I saw that my product reference
AB-CD-1234 is split in 3 token AB, CD and 123 while I was expected the
tokenizer to keep it as a whole.


So its look like the StandardTokenizer does not work as is should.


Am I right ?


I had a deeper look, and found out (
https://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
here ) the jflex source used to generate the StandardTokenizerImpl.


And here is how "product numbers" are defined: (P being the punctuation:
"_", "-", "/", "." and ",")


// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
          | {HAS_DIGIT} {P} {ALPHANUM}
          | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
          | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)


I am not a jflex expert, but it looks like the {ALPHANUM} ({P} {ALPHANUM}
{P} {HAS_DIGIT}) is missing ?

As well as all other patterns containing two digits or two alpha separated
by a punctuation. :


--
View this message in context: 
http://www.nabble.com/StandardTokenizer-issue---tp22471475p22471475.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to