subject:"Inconsistent tokenizing of words containing underscores."

Re: Inconsistent tokenizing of words containing underscores.

2005-08-30 Thread Erik Hatcher

u will have to use the same punctuation filter on the strings before you search for them. Tom -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Monday, August 29, 2005 3:15 PM To: java-user@lucene.apache.org Subject: Re: Inconsistent tokenizing of words containing unders

RE: Inconsistent tokenizing of words containing underscores.

2005-08-30 Thread Is, Studcio

ava-user@lucene.apache.org Subject: RE: Inconsistent tokenizing of words containing underscores. What seems to be working for me is a punctuation filter that removes / - _ etc and makes the token without them. Then "most" of the time the word XYZZZY_DE_SA0001 will be tokenized as XYZZZYD

RE: Inconsistent tokenizing of words containing underscores.

2005-08-29 Thread Aigner, Thomas

before you search for them. Tom -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Monday, August 29, 2005 3:15 PM To: java-user@lucene.apache.org Subject: Re: Inconsistent tokenizing of words containing underscores. On Monday 29 August 2005 19:21, Jeremy Meyer wro

Re: Inconsistent tokenizing of words containing underscores.

2005-08-29 Thread Daniel Naber

On Monday 29 August 2005 19:21, Jeremy Meyer wrote: > The expected behavior is to sometimes treat a character as indicating a > new token and other times to ignore the same character? It depends on whether there are digits in the token. It's documented in the javacc source for the tokenizer(?).

RE: Inconsistent tokenizing of words containing underscores.

2005-08-29 Thread Jeremy Meyer

- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Monday, August 29, 2005 10:56 AM To: java-user@lucene.apache.org Subject: Re: Inconsistent tokenizing of words containing underscores. That's StandardAnalyzer's expeceted behaviour. If you want tokenization to occur only on wh

Re: Inconsistent tokenizing of words containing underscores.

2005-08-29 Thread Otis Gospodnetic

That's StandardAnalyzer's expeceted behaviour. If you want tokenization to occur only on white spaces, use WhitespaceAnalyzer. If you want custom behaviour, you should write an Analyzer (there should be a FAQ entry with an example). Otis --- "Is, Studcio" <[EMAIL PROTECTED]> wrote: > Hello, >

Inconsistent tokenizing of words containing underscores.

2005-08-29 Thread Is, Studcio

Hello, I'm using Lucene for a few weeks now in a small project and just ran into a problem. My index contains words that contain one or more underlines, e.g. XYZZZY_DE_SA0001 or XYZZZY_AT0001. Unfortunately the tokenizer tokenizes / splits the word into multiple tokens at the underscores, except

Re: Inconsistent tokenizing of words containing underscores.

RE: Inconsistent tokenizing of words containing underscores.

RE: Inconsistent tokenizing of words containing underscores.

Re: Inconsistent tokenizing of words containing underscores.

RE: Inconsistent tokenizing of words containing underscores.

Re: Inconsistent tokenizing of words containing underscores.

Inconsistent tokenizing of words containing underscores.

7 matches

Site Navigation

Mail list logo

Footer information