If you already know the set of phrases you need to detect then you can use Lucene's SynonymFilter to spot them and insert a new token.
Mike McCandless http://blog.mikemccandless.com On Thu, Feb 20, 2014 at 7:21 AM, Benson Margulies <ben...@basistech.com> wrote: > It sounds like you've been asked to implement Named Entity Recognition. > OpenNLP has some capability here. There are also, um, commercial > alternatives. > > > On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio <ye.pe...@gmail.com>wrote: > >> On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar <geetgang...@gmail.com> >> wrote: >> >> Hi, >> >> > My requirement is it should have capabilities to match multiple words as >> > one token. for example. When user passes String as International Business >> > machine logo or IBM logo it should return International Business Machine >> as >> > one token and logo as one token. >> >> This is an interesting problem. I suppose that if the user enters >> "International Business Machines", possibly with some misspelling, you >> want to find all documents containing "IBM" - and that if he enters >> the string "IBM", you want to find documents which contain the string >> "International Business Machines", or even only parts of it. So this >> means you need some kind of map relating some acronyms with their >> content parts. There really are two directions here: acronym to >> content and content to acronym. >> >> One cannot find what an acronym means without some kind of acronym >> dictionary. This means that whatever approach you intend to use, there >> should be an external dictionary involved, which, for each acronym, >> would map a list of possible phrases. Retrieving all phrases matching >> the inputted acronym, you'd inject each part of each phrase as a token >> (removing possible duplicates between phrase parts). That's basically >> it for the direction "acronym to content". >> >> The direction "content to acronym" is trickier, I believe. One way is >> to generate a second (reversed) map, matching each acronym content >> part to a list of acronyms containing that part. You'd simply inject >> acronyms (and possibly other things) if one part of their content is >> matched (or more than one part, if you want to increase relevance). >> This could however possibly require the definition of a specific >> hashing mechanism, if you want to find approximate (distanced) keys >> (e.g. "intenational", with the lacking "r", would still find "IBM"). A >> second way (more coupled to the concept of acronym, so less generic) >> could be to consider that every word starting with a capital letter if >> part of an acronym, buffering sequences of words starting with a >> capital letter, and eventually injecting the resulting acronym, if >> found in the acronym dictionary. This might not be safe, though - the >> user may not have the discipline to capitalize the words being part of >> an acronym (or may even misspell the first letter), or concatenated >> first letters could match an irrelevant acronym (many word sequences >> can give the acronym "IBM"). >> >> I do not know whether there already exists some Lucene module which >> processes acronyms, or if someone is working on one. It's definitely >> worth a search though, because writing a good one from scratch could >> mean a few days of work, or more. >> >> HTH. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org