Inconsistent tokenizing of words containing underscores.

Is, Studcio Mon, 29 Aug 2005 07:45:13 -0700

Hello,
 
I'm using Lucene for a few weeks now in a small project and just ran
into a problem. My index contains words that contain one or more
underlines, e.g. XYZZZY_DE_SA0001 or XYZZZY_AT0001. Unfortunately the
tokenizer tokenizes / splits the word into multiple tokens at the
underscores, except the last underscore. 
 
For example the word XYZZZY_DE_SA0001 is tokenized as follows:
 
1. Token: XYZZY 
2. Token: DE_SA0001
 
which is not conforming to expectations. Either the tokenizer should
split at every underscore or at none.
 
I'm using Lucene 1.4.3 with
org.apache.lucene.analysis.standard.StandardAnalyzer and Java 1.4.2_08.
 
Has anybody experienced the same behaviour or can explain it? Could it
be a bug in the StandardTokenizer?
 
Many thanks in advance
 
Sebastian Seitz

Inconsistent tokenizing of words containing underscores.

Reply via email to