On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar <geetgang...@gmail.com> wrote:
Hi, > My requirement is it should have capabilities to match multiple words as > one token. for example. When user passes String as International Business > machine logo or IBM logo it should return International Business Machine as > one token and logo as one token. This is an interesting problem. I suppose that if the user enters "International Business Machines", possibly with some misspelling, you want to find all documents containing "IBM" - and that if he enters the string "IBM", you want to find documents which contain the string "International Business Machines", or even only parts of it. So this means you need some kind of map relating some acronyms with their content parts. There really are two directions here: acronym to content and content to acronym. One cannot find what an acronym means without some kind of acronym dictionary. This means that whatever approach you intend to use, there should be an external dictionary involved, which, for each acronym, would map a list of possible phrases. Retrieving all phrases matching the inputted acronym, you'd inject each part of each phrase as a token (removing possible duplicates between phrase parts). That's basically it for the direction "acronym to content". The direction "content to acronym" is trickier, I believe. One way is to generate a second (reversed) map, matching each acronym content part to a list of acronyms containing that part. You'd simply inject acronyms (and possibly other things) if one part of their content is matched (or more than one part, if you want to increase relevance). This could however possibly require the definition of a specific hashing mechanism, if you want to find approximate (distanced) keys (e.g. "intenational", with the lacking "r", would still find "IBM"). A second way (more coupled to the concept of acronym, so less generic) could be to consider that every word starting with a capital letter if part of an acronym, buffering sequences of words starting with a capital letter, and eventually injecting the resulting acronym, if found in the acronym dictionary. This might not be safe, though - the user may not have the discipline to capitalize the words being part of an acronym (or may even misspell the first letter), or concatenated first letters could match an irrelevant acronym (many word sequences can give the acronym "IBM"). I do not know whether there already exists some Lucene module which processes acronyms, or if someone is working on one. It's definitely worth a search though, because writing a good one from scratch could mean a few days of work, or more. HTH. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org