On 27 Oct 2005, at 12:13, Rob Young wrote:
I'm using StandardAnalyzer during indexing and I have noticed that it splits hyphenated words in two, ditching the hyphen. This is messing up some of my search results. I would like to keep using StandardAnalyzer because it's very good on the whole, however I would like to add an extra term in these cases. I am fine doing everything except figuring out when StandardTokenizer has split a hyphenated word. All I get is the individual tokens with a type ALPHANUM. Can anyone think of a way I can do this without having to dive into StandardTokenizer?

I have looked at the source for StandardTokenizer and I really really really don't want to have to go there :/

StandardTokenizer is a JavaCC grammar - and it's actually not that complex, though JavaCC is a whole other technology to learn if you've not done it before. Look at StandardTokenizer.jj, not .java.

You could pretty easily modify the .jj file and add the hyphen to the alphanumeric tokens, rebuild it using JavaCC (the Ant build file for Lucene can do this for you once you have JavaCC).

Using StandardTokenizer without modifying it won't be possible to achieve what you're after - the damage is already done on the output of StandardTokenizer.

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to