Hi John, (aren't you the same John Byrne who is a key contributor to the great OpenSSI project?)
John Byrne-3 wrote: > > I'm inclined to disagree with the idea that a token should not be split > again downstream. I think that is actually a much easier way to handle > it. I would have the tokenizer return the longest match, and then split > it in a token filter. In fact I have dones this before and it has worked > fine for me. > well, I could soften my position: if the token re-parsing is done by looking into currentlexem value only, then it might be perhaps accepted. In contrast, if during your re-parsing you have to look into the upstream characters data "several filters backwards", then, IMHO, it is rather messy and unacceptable. Regarding this part: John Byrne-3 wrote: > > I think you will have to maintain some state within the token filter > [...] > I would wait for Simon's answer to the question "What do you expect from the Tokenizer?" Then I will give my 2cents on this and perhaps then I could sum up all opinions and adopt a common conclusion. :) regards Valery -- View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25076151.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org