Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Valery Thu, 20 Aug 2009 08:16:22 -0700

Hi Robert, 

thanks for the hint.


Indeed, a natural way to go. Especially if one builds a Tokenizer of the
level of quality like StandardTokenizer's. 

OTOH, you mean that the out-of-the-box stuff is indeed not customizable for
this task?..

regards
Valery



Robert Muir wrote:
> 
> Valery,
> 
> One thing you could try would be to create a JFlex-based tokenizer,
> specifying a grammar with the rules you want.
> You could use the source code & grammar of StandardTokenizer as a
> starting point.
> 
> 
> On Thu, Aug 20, 2009 at 10:28 AM, Valery<khame...@gmail.com> wrote:
>>
>> Hi all,
>>
>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>>
>> The task is known for Lucene community, but surprisingly I can't google
>> out
>> somewhat good info on it.
>>
>> Of course, I tried to re-use Lucene's  building blocks for Tokenizer.
>> Here
>> we go:
>>
>>  1) StandardTokenizer -- oh, this option would be just fantastic, but
>> "C++,
>> C#, .NET" ends up with "c c net". Too bad.
>>
>>  2) WhitespaceTokenizer gives me a lot of lexems that are actually should
>> have been chopped into smaller pieces. Example: "C/C++" comes out like a
>> single lexem. If I follow this way I end-up with "Tokenization of tokens"
>> --
>> that sounds a bit odd, doesn't it?
>>
>>  3) CharTokenizer allows me to add the '/' to be also a token-emitting
>> char, but then '/' gets immediately lost like those whitespace chars. In
>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search
>> the
>> original char stream for the "/" char to re-build "SAP R/3" term as a
>> whole.
>>
>> Do you see any other relevant building blocks missed by me?
>>
>> Also, people around there have meant that such problem should be solved
>> by a
>> synonym dictionary. However this hint sheds no light on which
>> tokenization
>> strategy should be more appropriate *before* the synonym step.
>>
>> So, it looks like I have to take the class CharTokenizer as for the
>> starting
>> point and write anew my own Tokenizer. This Tokenizer should also react
>> on
>> delimiting characters and emit the token. However, it should distinguish
>> between delimiters like whitespaces along with ";,?" and the delimiters
>> like
>> "./&".
>>
>> Indeed, the delimiters like whitespaces and ";,?" should be thrown away
>> from
>> Lexem level,
>> whereas the token emitting characters like "./&" should be kept in Lexem
>> level.
>>
>> Your comments, gurus?
>>
>> regards,
>> Valery
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
> 
> 
> 
> -- 
> Robert Muir
> rcm...@gmail.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Reply via email to