Re: AW: Part-Of Match

Chris Hostetter Sun, 15 Jan 2006 13:14:53 -0800

: >>von Willebrand<< is not the query but a document in the index.... The task
: is to detect exact matches of phrases inside a query (large document) with
: these phrases stored in the index.


Lemme see if i can restate your problem...

You want to build a data repository in which you insert a large magnatude
of "concepts" where a concept is a short phrase consisting of a few words
(possibly just one word).  The words in any given concept phrase may
overlap (or be a super set) of the words in other concepts.

Once this concept repository is built, you want to to build a black box
arround it, such that people can hand your black box a "document"
(ie: a research paper, a newpaper article, a short story, ...
some text consisting of many many sentences) and you want your black box
to then return the list of concepts that match the input document, such
that the cnceptss with the highest score are concepts whose phrase appears
exactly in the input document.  Concepts whose phrase doesn't appear
exactly in the document shoudl still be returned, but with a lower score
based on how many words in the concept's phrase are found in the input
document.

        (have i adequetly described your problem?)

It's an interesting idea.  can it be done with lucene? ... i can think of
one kludgy mechanism for doing it but i'd be very suprised if there isn't
a better way (or if there is some other software library out there that
would be more suited)

Build a permentant index in which each concept is a Lucene Document.
these documents really only need one stored/tokenized/indexed field
containing the phrase (if you want other payload fields that's up to you).

Each time you are asked to analyze a Text sample and return matching
phrases, run the text through your analyzer to get back a tokenstream, and
for each of those tokens, use a TermDocs iterator to find out if any
phrase in your concept index contains that term, and if so which ones.
(you could also do this by building a boolean OR query out of all the
words in your input document -- but that may run into performance
limitatios if your input docs are too big, and it will try to score each
concept which isn't neccessary so even for short input text it's less
efficient).

Now you have an (unordered) list of concepts that have something to do
with your input text.

Next build a RAMDirectory based index consisting of exactly one document
which you build from the input text.  Loop over that list of concepts you
got, and build a boolean query out of each one along the lines that
Daniel described: a phrase query on the whole concept phrase along with
term queries for each individual word -- all optional.  run each of these
boolean queries against your one document RAMDirectory.  the higher the
score, the better that concept applies to your input text.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: AW: Part-Of Match

Reply via email to