I spotted an unexepcted behavior when using the StandardAnalyzer.
This analyzer uses the StandardTokenizer which javadoc states: Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. But looking to my index with luke, I saw that my product reference AB-CD-1234 is split in 3 token AB, CD and 123 while I was expected the tokenizer to keep it as a whole. So its look like the StandardTokenizer does not work as is should. Am I right ? I had a deeper look, and found out ( https://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex here ) the jflex source used to generate the StandardTokenizerImpl. And here is how "product numbers" are defined: (P being the punctuation: "_", "-", "/", "." and ",") // floating point, serial, model numbers, ip addresses, etc. // every other segment must have at least one digit NUM = ({ALPHANUM} {P} {HAS_DIGIT} | {HAS_DIGIT} {P} {ALPHANUM} | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+ | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+) I am not a jflex expert, but it looks like the {ALPHANUM} ({P} {ALPHANUM} {P} {HAS_DIGIT}) is missing ? As well as all other patterns containing two digits or two alpha separated by a punctuation. : -- View this message in context: http://www.nabble.com/StandardTokenizer-issue---tp22471475p22471475.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.