Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Simon Willnauer Fri, 21 Aug 2009 02:09:55 -0700

On Fri, Aug 21, 2009 at 10:26 AM, Valery<khame...@gmail.com> wrote:
>
> Hi Simon,
>
>
> Simon Willnauer wrote:
>>
>> Valery, have you tried to use whitespaceTokenizer / CharTokenizer and
>> [...]?!
>>
>> simon
>>
>
> yes, I did, please find the info in the initial message. Here are the
> excerpts:
>
>
> Valery wrote:
>>
>>   2) WhitespaceTokenizer gives me a lot of lexems that are actually should
>> have been chopped into smaller pieces. Example: "C/C++" comes out like a
>> single lexem. If I follow this way I end-up with "Tokenization of tokens"
>> -- that sounds a bit odd, doesn't it?
>>
>>   3) CharTokenizer allows me to add the '/' to be also a token-emitting
>> char, but then '/' gets immediately lost like those whitespace chars. In
>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search
>> the original char stream for the "/" char to re-build "SAP R/3" term as a
>> whole.
>
> regarding this part:
>
>
> Simon Willnauer wrote:
>>
>> Valery, have you tried to [...] and do any further processing in a  custom
>> TokenFilter?!
>> simon
>>
>
> yes, and that's why I have sent the initial post "Any Tokenizator friendly
> to C++, C#, .NET, etc ?"
> Actually, I am a bit confused to do a Tokenizer's job in filters and
> re-parse the char stream.
>
> Simon, what do you expect from the Tokenizer?
> (In other words, what job is exclusively "Tokenizer's Job" and should rather
> not be done in downstream filters?)


Well, Tokenizer, TokenFilter both of them are sub are subclasses of
TokenStream while their input differ. A Tokenizer gets the input from
a reader and creates Tokens from this input. A TokenFilter uses the
tokens created by the Tokenizer and does further processing. For
instance. An Analyzer that uses WhitespaceTokenizer as an input for
LowerCaseFilter would produce the following:


Input:  C# or .NET

WhitespaceTokenizer:
[Tokenstring: "C#"; offset: 0->2; pos: 1]
[Tokenstring: "or"; offset: 3->5; pos: 2]
[Tokenstring: ".Net"; offset: 6->10; pos: 3]
LowerCaseFilter:
[Tokenstring: "c#"; offset: 0->2; pos: 1]
[Tokenstring: "or"; offset: 3->5; pos: 2]
[Tokenstring: ".net"; offset: 6->10; pos: 3]

if you wanna do any further processing with those tokens you can add
your own TokenFilter and modify the tokens as you need. you could do
the whole job in a Tokenizer but this would not be a good separation
of concerns right!?

simon

>
> regards,
> Valery
>
> --
> View this message in context: 
> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25075903.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Reply via email to