Hi Valery,
I'm inclined to disagree with the idea that a token should not be split
again downstream. I think that is actually a much easier way to handle
it. I would have the tokenizer return the longest match, and then split
it in a token filter. In fact I have dones this before and it has worked
fine for me.
I think you will have to maintain some state within the token filter
either way - think about how you'd do that in each case:
-To split longer tokens, you will just have to split the token into a
list of sub-tokens, then temporarily store these, and return them on
subsequent calls to the filter. When the list is empty, you get another
token from the tokenizer.
-To join up shorter tokens, you would have to basically do what he
original tokenizer did - try to match sequences of characters to patterns.
The second way sounds harder to me, and it's the job of the original
tokenizer anyway. Do the simplest thing that could possibly work!
Regards,
-John
Valery wrote:
Hi Robert,
so, would you expect a Tokenizer to consider '/' in both cases as a separate
Token?
Personally, I see no problem if Tokenzer would do the following job:
"C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"}
and come up with "C" and "C++" tokens after processing through the
downstream tokenfilters.
Similarly:
"SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"}
and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.
I try to follow a spirit that a token (or its lexem) usually should never be
parsed again. One can build more complex (compound) things from the tokens.
However, usually one never chops a lexem into smaller pieces.
What do you think, Robert?
regards,
Valery
------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.392 / Virus Database: 270.13.62/2315 - Release Date: 08/20/09 06:05:00
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org