Re: Custom Tokenizer/Analyzer

Michael McCandless Thu, 20 Feb 2014 04:31:56 -0800

If you already know the set of phrases you need to detect then you can
use Lucene's SynonymFilter to spot them and insert a new token.


Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 20, 2014 at 7:21 AM, Benson Margulies <ben...@basistech.com> wrote:
> It sounds like you've been asked to implement Named Entity Recognition.
> OpenNLP has some capability here. There are also, um, commercial
> alternatives.
>
>
> On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio <ye.pe...@gmail.com>wrote:
>
>> On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar <geetgang...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> > My requirement is it should have capabilities to match multiple words as
>> > one token. for example. When user passes String as International Business
>> > machine logo or IBM logo it should return International Business Machine
>> as
>> > one token and logo as one token.
>>
>> This is an interesting problem. I suppose that if the user enters
>> "International Business Machines", possibly with some misspelling, you
>> want to find all documents containing "IBM" - and that if he enters
>> the string "IBM", you want to find documents which contain the string
>> "International Business Machines", or even only parts of it. So this
>> means you need some kind of map relating some acronyms with their
>> content parts. There really are two directions here: acronym to
>> content and content to acronym.
>>
>> One cannot find what an acronym means without some kind of acronym
>> dictionary. This means that whatever approach you intend to use, there
>> should be an external dictionary involved, which, for each acronym,
>> would map a list of possible phrases. Retrieving all phrases matching
>> the inputted acronym, you'd inject each part of each phrase as a token
>> (removing possible duplicates between phrase parts). That's basically
>> it for the direction "acronym to content".
>>
>> The direction "content to acronym" is trickier, I believe. One way is
>> to generate a second (reversed) map, matching each acronym content
>> part to a list of acronyms containing that part. You'd simply inject
>> acronyms (and possibly other things) if one part of their content is
>> matched (or more than one part, if you want to increase relevance).
>> This could however possibly require the definition of a specific
>> hashing mechanism, if you want to find approximate (distanced) keys
>> (e.g. "intenational", with the lacking "r", would still find "IBM"). A
>> second way (more coupled to the concept of acronym, so less generic)
>> could be to consider that every word starting with a capital letter if
>> part of an acronym, buffering sequences of words starting with a
>> capital letter, and eventually injecting the resulting acronym, if
>> found in the acronym dictionary. This might not be safe, though - the
>> user may not have the discipline to capitalize the words being part of
>> an acronym (or may even misspell the first letter), or concatenated
>> first letters could match an irrelevant acronym (many word sequences
>> can give the acronym "IBM").
>>
>> I do not know whether there already exists some Lucene module which
>> processes acronyms, or if someone is working on one. It's definitely
>> worth a search though, because writing a good one from scratch could
>> mean a few days of work, or more.
>>
>> HTH.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Custom Tokenizer/Analyzer

Reply via email to