Hi Valery,

I'm inclined to disagree with the idea that a token should not be split again downstream. I think that is actually a much easier way to handle it. I would have the tokenizer return the longest match, and then split it in a token filter. In fact I have dones this before and it has worked fine for me.

I think you will have to maintain some state within the token filter either way - think about how you'd do that in each case: -To split longer tokens, you will just have to split the token into a list of sub-tokens, then temporarily store these, and return them on subsequent calls to the filter. When the list is empty, you get another token from the tokenizer.

-To join up shorter tokens, you would have to basically do what he original tokenizer did - try to match sequences of characters to patterns.

The second way sounds harder to me, and it's the job of the original tokenizer anyway. Do the simplest thing that could possibly work!

Regards,
-John

Valery wrote:
Hi Robert,
so, would you expect a Tokenizer to consider '/' in both cases as a separate
Token?

Personally, I see no problem if Tokenzer would do the following job:

"C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"} and come up with "C" and "C++" tokens after processing through the
downstream tokenfilters.

Similarly:

"SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"} and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.

I try to follow a spirit that a token (or its lexem) usually should never be
parsed again. One can build  more complex (compound) things from the tokens.
However, usually one never chops a lexem into smaller pieces.

What do you think, Robert?

regards,
Valery

------------------------------------------------------------------------


No virus found in this incoming message.
Checked by AVG - www.avg.com Version: 8.5.392 / Virus Database: 270.13.62/2315 - Release Date: 08/20/09 06:05:00



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to