"Michael D. Curtin" <[EMAIL PROTECTED]> wrote on 07/06/2007 13:30:28:
> > I think it splits by hyphens unless the no-hyphen > > part has digits, so: > > np-pandock-a7 > > becomes > > np > > pandock-a7 > > This is for the indexing part. > > Wow! Do you know the thinking behind that, i.e. why a number in a > hyphenated expression prevents the split? I actually asked myself the same question before the previous post - javadocs for StandardAnalyzer just has the obvious - a grammar-based tokenizer constructed with JavaCC.... - the wiki page AnalysisParalysis also didn't explain much on the logic behind it. >From the StandardAnalyzer javacc grammar : // floating point, serial, model numbers, ip addresses, etc. // every other segment must have at least one digit <NUM: (<ALPHANUM> <P> <HAS_DIGIT> .... etc. <#P: ("_"|"-"|"/"|"."|",") > My understanding of this: a non-whitespace sequence is broken at either of these 5 chars _ - / . , unless the part that follows part has a digit, in which case it is assumed to be (part of) a serial no., model, etc. Seems we can improve the documentation here. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]