On Fri, Aug 21, 2009 at 12:51 PM, Valery<khame...@gmail.com> wrote: > > Hi John, > > (aren't you the same John Byrne who is a key contributor to the great > OpenSSI project?) > > > John Byrne-3 wrote: >> >> I'm inclined to disagree with the idea that a token should not be split >> again downstream. I think that is actually a much easier way to handle >> it. I would have the tokenizer return the longest match, and then split >> it in a token filter. In fact I have dones this before and it has worked >> fine for me. >> > > well, I could soften my position: if the token re-parsing is done by looking > into currentlexem value only, then it might be perhaps accepted. In > contrast, if during your re-parsing you have to look into the upstream > characters data "several filters backwards", then, IMHO, it is rather messy > and unacceptable. > > > Regarding this part: > > John Byrne-3 wrote: >> >> I think you will have to maintain some state within the token filter >> [...] >> > > I would wait for Simon's answer to the question "What do you expect from the > Tokenizer?" > I already responded... again... <snip> Well, Tokenizer, TokenFilter both of them are sub are subclasses of TokenStream while their input differ. A Tokenizer gets the input from a reader and creates Tokens from this input. A TokenFilter uses the tokens created by the Tokenizer and does further processing. For instance. An Analyzer that uses WhitespaceTokenizer as an input for LowerCaseFilter would produce the following:
Input: C# or .NET WhitespaceTokenizer: [Tokenstring: "C#"; offset: 0->2; pos: 1] [Tokenstring: "or"; offset: 3->5; pos: 2] [Tokenstring: ".Net"; offset: 6->10; pos: 3] LowerCaseFilter: [Tokenstring: "c#"; offset: 0->2; pos: 1] [Tokenstring: "or"; offset: 3->5; pos: 2] [Tokenstring: ".net"; offset: 6->10; pos: 3] if you wanna do any further processing with those tokens you can add your own TokenFilter and modify the tokens as you need. you could do the whole job in a Tokenizer but this would not be a good separation of concerns right!? </snip> Tokenizer splits the input stream into tokens (Token.java) and TokenFilter subclasses operate on those. I expect from a Tokenizer that is provides me a stream of tokens :) - how those tokens are created is the responsibility of the Tokenizer. to LowerCase, remove Stopwords, adding payloads etc. is the job of the TokenFilter. simon > Then I will give my 2cents on this and perhaps then I could sum up all > opinions and adopt a common conclusion. > :) > > regards > Valery > > -- > View this message in context: > http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25076151.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org