On Apr 11, 2005, at 9:36 AM, Peter Hotm. Nørregaard wrote:
According to "Lucene in Action" it is possible to get synonyms indexed together with a word by putting multiple words with the same position-id in the term vector.

My problem is, however, that some words needs to have alternatives where the word is decomposed / decompounded into two or more words:

"FooBar Corp" or "cybercafe"

should be found when searching for

"Foo Ba*" or "cyber cafe"

First, the phrase query with wildcards is not currently a built-in capability of Lucene.


You'll need some kind of lookup to know how to split a token like "cybercafe" into two words - once you've done that it will be easy to set the position increment of them to zero so that they overlay the original term.

The reverse is also true: The "Foo Bar Corp" should be found with "Foob* corp".

Again the caveat about wildcard phrase queries applies. You'll need to use that same lookup mechanism to combine multiple tokens coming through a token filter into a single combined one.



So how do I solve this problem?

The approach I've described above is a crude dictionary-based one, but I'm sure other tricks could be employed but they would be quite sophisticated (N-gram sequencing ala LingPipe comes to mind).


        Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to