On 27 Oct 2005, at 12:13, Rob Young wrote:
I'm using StandardAnalyzer during indexing and I have noticed that
it splits hyphenated words in two, ditching the hyphen. This is
messing up some of my search results. I would like to keep using
StandardAnalyzer because it's very good on the whole, however I
would like to add an extra term in these cases. I am fine doing
everything except figuring out when StandardTokenizer has split a
hyphenated word. All I get is the individual tokens with a type
ALPHANUM. Can anyone think of a way I can do this without having to
dive into StandardTokenizer?
I have looked at the source for StandardTokenizer and I really
really really don't want to have to go there :/
StandardTokenizer is a JavaCC grammar - and it's actually not that
complex, though JavaCC is a whole other technology to learn if you've
not done it before. Look at StandardTokenizer.jj, not .java.
You could pretty easily modify the .jj file and add the hyphen to the
alphanumeric tokens, rebuild it using JavaCC (the Ant build file for
Lucene can do this for you once you have JavaCC).
Using StandardTokenizer without modifying it won't be possible to
achieve what you're after - the damage is already done on the output
of StandardTokenizer.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]