from:"Is, Studcio"

Inconsistent tokenizing of words containing underscores.

2005-08-29 Thread Is, Studcio

Hello, I'm using Lucene for a few weeks now in a small project and just ran into a problem. My index contains words that contain one or more underlines, e.g. XYZZZY_DE_SA0001 or XYZZZY_AT0001. Unfortunately the tokenizer tokenizes / splits the word into multiple tokens at the underscores, except

RE: Inconsistent tokenizing of words containing underscores.

2005-08-30 Thread Is, Studcio

Hello, first of all thanks to everyone for replies and suggestions. I solved my problem by adapting the StandardTokenizer.jj and compiling it using javacc. I replaced line 90: |)+ > with ||"_")+ > so that underscore is treated like alphanumeric characters. In my first tests, it seems to work