Re: Custom Tokenizer/Analyzer

Yann-Erwan Perio Thu, 20 Feb 2014 03:25:27 -0800

On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar <geetgang...@gmail.com> wrote:


Hi,

> My requirement is it should have capabilities to match multiple words as
> one token. for example. When user passes String as International Business
> machine logo or IBM logo it should return International Business Machine as
> one token and logo as one token.

This is an interesting problem. I suppose that if the user enters
"International Business Machines", possibly with some misspelling, you
want to find all documents containing "IBM" - and that if he enters
the string "IBM", you want to find documents which contain the string
"International Business Machines", or even only parts of it. So this
means you need some kind of map relating some acronyms with their
content parts. There really are two directions here: acronym to
content and content to acronym.

One cannot find what an acronym means without some kind of acronym
dictionary. This means that whatever approach you intend to use, there
should be an external dictionary involved, which, for each acronym,
would map a list of possible phrases. Retrieving all phrases matching
the inputted acronym, you'd inject each part of each phrase as a token
(removing possible duplicates between phrase parts). That's basically
it for the direction "acronym to content".

The direction "content to acronym" is trickier, I believe. One way is
to generate a second (reversed) map, matching each acronym content
part to a list of acronyms containing that part. You'd simply inject
acronyms (and possibly other things) if one part of their content is
matched (or more than one part, if you want to increase relevance).
This could however possibly require the definition of a specific
hashing mechanism, if you want to find approximate (distanced) keys
(e.g. "intenational", with the lacking "r", would still find "IBM"). A
second way (more coupled to the concept of acronym, so less generic)
could be to consider that every word starting with a capital letter if
part of an acronym, buffering sequences of words starting with a
capital letter, and eventually injecting the resulting acronym, if
found in the acronym dictionary. This might not be safe, though - the
user may not have the discipline to capitalize the words being part of
an acronym (or may even misspell the first letter), or concatenated
first letters could match an irrelevant acronym (many word sequences
can give the acronym "IBM").

I do not know whether there already exists some Lucene module which
processes acronyms, or if someone is working on one. It's definitely
worth a search though, because writing a good one from scratch could
mean a few days of work, or more.

HTH.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Custom Tokenizer/Analyzer

Reply via email to