Re: Wild card and multiple keyword search

Erik Hatcher Wed, 13 Jul 2005 06:39:45 -0700


On Jul 13, 2005, at 8:18 AM, Rahul D Thakare wrote:

We are using doc.add(Field.Text("keywords",keywords)); to add thekeywords to the document, where keywords is comma separatedkeywords string.

If the text is already comma separated and that is the level at whichyou things tokenized, then simply do something like this (untestedpseudo-code):


    String[] values = keywords.split(",");
    for (int i=0; i < values.length; i++)
        doc.add(Field.Keyword("keywords", values[i]));

Lucene seems to tokenize the keywords with multiple words like(MAINBOARD) as different keywords(ie as MAIN and BOARD). Tokenization isbased on comma and space...So if we search for "MAIN BOARD",documents having keywords like "MAIN LOGIC", "MAIN PARTS", etc alsoshow up
If one searches for "MAIN BOARD", we want get only the documentshave "MAIN BOARD". How to do this ?

The question back to you is do you want searches for simply "MAIN" tofind both "MAIN LOGIC" and "MAIN PARTS"? Or should it return nodocuments since its not an exact match?

Using the above code, "MAIN" would find neither of those and thequery would have to be exact. I see below you've clarified thisrequirement...

To achieve this we used doc.add(Field.Keyword("keywords",keywords)); and while searchingwe cannot use standard analyzer, while searching, as divides thekeywords if we search keywords having space... so we wrote anKeywordAnalyser(KeywordAnalyzer is basically returns only onesingle token) as given below.

There is a KeywordAnalyzer now in the contrib/analyzers codebase, andit will ship with the next version of Lucene (or you could build ityourself and use it). There is also a couple of variants of theKeywordAnalyzer in the Lucene in Action code (www.lucenebook.com).

Which solve the above said problem, but we are not able to the wildcard searchs like MAIN*, etc.
We need both the functionality ie.
1. if user searches for MAIN BOARD, should get only documents thatcontain MAIN BOARD and not MAIN LOGIC, MAIN, MAIN PART etc.2. User should be able to do the wild card search like MAIN*, etcand get the desired documents.
Please let us know, how we should do the indexing ? and whichanalyzer to use to do the search ?

There are many ways to go about this sort of thing, and I apologizefor being short on time and not able to explain them all fully. Oneoption is to keep the tokenization using a traditional analyzer sothat it separates by whitespace, but when a user queries it turnsinto a PhraseQuery. If you really mean for wildcards to be singlewords in the field (in other words, users don't need to query on MA*)then the space separated tokenization would work fine here as well.

It is important to think through the analysis process as well as thesearch interface issues (the interface must be given thoroughconsideration and treated as a first class citizen when discussingimplementations), especially when wildcard and range queries comeup. It has been a hot topic recently on how to deal with wildcardsand ranges efficiently. In your example, if by "MAIN*" you intendfor the word MAIN to be a unique token and the user would choose afull word to search upon and merely wants to find it within a largerfield then wildcards are not necessary.


    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Wild card and multiple keyword search

Reply via email to