Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Simon Willnauer Fri, 21 Aug 2009 04:12:25 -0700

On Fri, Aug 21, 2009 at 12:51 PM, Valery<khame...@gmail.com> wrote:
>
> Hi John,
>
> (aren't you the same John Byrne who is a key contributor to the great
> OpenSSI project?)
>
>
> John Byrne-3 wrote:
>>
>> I'm inclined to disagree with the idea that a token should not be split
>> again downstream. I think that is actually a much easier way to handle
>> it. I would have the tokenizer return the longest match, and then split
>> it in a token filter. In fact I have dones this before and it has worked
>> fine for me.
>>
>
> well, I could soften my position: if the token re-parsing is done by looking
> into currentlexem value only, then it might be perhaps accepted. In
> contrast, if during your re-parsing you have to look into the upstream
> characters data "several filters backwards", then, IMHO, it is rather messy
> and unacceptable.
>
>
> Regarding this part:
>
> John Byrne-3 wrote:
>>
>> I think you will have to maintain some state within the token filter
>> [...]
>>
>
> I would wait for Simon's answer to the question "What do you expect from the
> Tokenizer?"
>
I already responded...
again...
<snip>
Well, Tokenizer, TokenFilter both of them are sub are subclasses of
TokenStream while their input differ. A Tokenizer gets the input from
a reader and creates Tokens from this input. A TokenFilter uses the
tokens created by the Tokenizer and does further processing. For
instance. An Analyzer that uses WhitespaceTokenizer as an input for
LowerCaseFilter would produce the following:



Input:  C# or .NET

WhitespaceTokenizer:
[Tokenstring: "C#"; offset: 0->2; pos: 1]
[Tokenstring: "or"; offset: 3->5; pos: 2]
[Tokenstring: ".Net"; offset: 6->10; pos: 3]
LowerCaseFilter:
[Tokenstring: "c#"; offset: 0->2; pos: 1]
[Tokenstring: "or"; offset: 3->5; pos: 2]
[Tokenstring: ".net"; offset: 6->10; pos: 3]

if you wanna do any further processing with those tokens you can add
your own TokenFilter and modify the tokens as you need. you could do
the whole job in a Tokenizer but this would not be a good separation
of concerns right!?
</snip>

Tokenizer splits the input stream into tokens (Token.java) and
TokenFilter subclasses operate on those. I expect from a Tokenizer
that is provides me a stream of tokens :) - how those tokens are
created is the responsibility of the Tokenizer. to LowerCase, remove
Stopwords, adding payloads etc. is the job of the TokenFilter.

simon

> Then I will give my 2cents on this and perhaps then I could sum up all
> opinions and adopt a common conclusion.
> :)
>
> regards
> Valery
>
> --
> View this message in context: 
> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25076151.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Reply via email to