According to "Lucene in Action" it is possible to get synonyms indexed together with a word by putting multiple words with the same position-id in the term vector.
My problem is, however, that some words needs to have alternatives where the word is decomposed / decompounded into two or more words:
"FooBar Corp" or "cybercafe"
should be found when searching for
"Foo Ba*" or "cyber cafe"
First, the phrase query with wildcards is not currently a built-in capability of Lucene.
You'll need some kind of lookup to know how to split a token like "cybercafe" into two words - once you've done that it will be easy to set the position increment of them to zero so that they overlay the original term.
The reverse is also true: The "Foo Bar Corp" should be found with "Foob* corp".
Again the caveat about wildcard phrase queries applies. You'll need to use that same lookup mechanism to combine multiple tokens coming through a token filter into a single combined one.
So how do I solve this problem?
The approach I've described above is a crude dictionary-based one, but I'm sure other tricks could be employed but they would be quite sophisticated (N-gram sequencing ala LingPipe comes to mind).
Erik
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]