Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Valery Thu, 20 Aug 2009 10:32:44 -0700

Hi Ken, 

thanks for the comments. Well, Terrence's ANTLR was and is a good piece of
work.


Do you mean that you use ANTLR to generate a Tokenzer (lexem parser) 

or 

did you even proceed further and used ANTLR to generate higher level parsers
to overrule Lucene's TokenFilters?

or maybe even both?..

regards,
Valery



Ken Krugler wrote:
> 
> Hi Valery,
> 
>  From our experience at Krugle, we wound up having to create our own  
> tokenizers (actually kind of  specialized parser) for the different  
> languages. It didn't seem like a good option to try to twist one of  
> the existing tokenizers into something that would work well enough. We  
> wound up using ANTLR for this.
> 
> -- Ken
> 
> 
> On Aug 20, 2009, at 8:09am, Valery wrote:
> 
>>
>> Hi Robert,
>>
>> thanks for the hint.
>>
>> Indeed, a natural way to go. Especially if one builds a Tokenizer of  
>> the
>> level of quality like StandardTokenizer's.
>>
>> OTOH, you mean that the out-of-the-box stuff is indeed not  
>> customizable for
>> this task?..
>>
>> regards
>> Valery
>>
>>
>>
>> Robert Muir wrote:
>>>
>>> Valery,
>>>
>>> One thing you could try would be to create a JFlex-based tokenizer,
>>> specifying a grammar with the rules you want.
>>> You could use the source code & grammar of StandardTokenizer as a
>>> starting point.
>>>
>>>
>>> On Thu, Aug 20, 2009 at 10:28 AM, Valery<khame...@gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>>>>
>>>> The task is known for Lucene community, but surprisingly I can't  
>>>> google
>>>> out
>>>> somewhat good info on it.
>>>>
>>>> Of course, I tried to re-use Lucene's  building blocks for  
>>>> Tokenizer.
>>>> Here
>>>> we go:
>>>>
>>>>  1) StandardTokenizer -- oh, this option would be just fantastic,  
>>>> but
>>>> "C++,
>>>> C#, .NET" ends up with "c c net". Too bad.
>>>>
>>>>  2) WhitespaceTokenizer gives me a lot of lexems that are actually  
>>>> should
>>>> have been chopped into smaller pieces. Example: "C/C++" comes out  
>>>> like a
>>>> single lexem. If I follow this way I end-up with "Tokenization of  
>>>> tokens"
>>>> --
>>>> that sounds a bit odd, doesn't it?
>>>>
>>>>  3) CharTokenizer allows me to add the '/' to be also a token- 
>>>> emitting
>>>> char, but then '/' gets immediately lost like those whitespace  
>>>> chars. In
>>>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to  
>>>> search
>>>> the
>>>> original char stream for the "/" char to re-build "SAP R/3" term  
>>>> as a
>>>> whole.
>>>>
>>>> Do you see any other relevant building blocks missed by me?
>>>>
>>>> Also, people around there have meant that such problem should be  
>>>> solved
>>>> by a
>>>> synonym dictionary. However this hint sheds no light on which
>>>> tokenization
>>>> strategy should be more appropriate *before* the synonym step.
>>>>
>>>> So, it looks like I have to take the class CharTokenizer as for the
>>>> starting
>>>> point and write anew my own Tokenizer. This Tokenizer should also  
>>>> react
>>>> on
>>>> delimiting characters and emit the token. However, it should  
>>>> distinguish
>>>> between delimiters like whitespaces along with ";,?" and the  
>>>> delimiters
>>>> like
>>>> "./&".
>>>>
>>>> Indeed, the delimiters like whitespaces and ";,?" should be thrown  
>>>> away
>>>> from
>>>> Lexem level,
>>>> whereas the token emitting characters like "./&" should be kept in  
>>>> Lexem
>>>> level.
>>>>
>>>> Your comments, gurus?
>>>>
>>>> regards,
>>>> Valery
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
>>>> Sent from the Lucene - Java Users mailing list archive at  
>>>> Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> -- 
>>> Robert Muir
>>> rcm...@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
> 
> --------------------------
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-210-6378
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066540.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Reply via email to