Re: Better analysis of hyphenated words

Erik Hatcher Thu, 27 Oct 2005 11:14:15 -0700


On 27 Oct 2005, at 12:13, Rob Young wrote:

I'm using StandardAnalyzer during indexing and I have noticed thatit splits hyphenated words in two, ditching the hyphen. This ismessing up some of my search results. I would like to keep usingStandardAnalyzer because it's very good on the whole, however Iwould like to add an extra term in these cases. I am fine doingeverything except figuring out when StandardTokenizer has split ahyphenated word. All I get is the individual tokens with a typeALPHANUM. Can anyone think of a way I can do this without having todive into StandardTokenizer?
I have looked at the source for StandardTokenizer and I reallyreally really don't want to have to go there :/

StandardTokenizer is a JavaCC grammar - and it's actually not thatcomplex, though JavaCC is a whole other technology to learn if you'venot done it before. Look at StandardTokenizer.jj, not .java.

You could pretty easily modify the .jj file and add the hyphen to thealphanumeric tokens, rebuild it using JavaCC (the Ant build file forLucene can do this for you once you have JavaCC).

Using StandardTokenizer without modifying it won't be possible toachieve what you're after - the damage is already done on the outputof StandardTokenizer.


    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Better analysis of hyphenated words

Reply via email to