Can I use Lucene to solve this problem?

Josh Rehman Tue, 16 Aug 2011 20:05:03 -0700

My organization is looking to solve a difficult problem, and I believe that
Lucene is a close fit (although perhaps it is not). However I'm not sure
exactly how to approach this problem.


The problem is this: given a small set of fixed noun phrases and a much
larger set of human generated short sentences, determine whether the
sentences refer to those noun phrases. For example, perhaps I have these
noun phrases:

   1. Bright yellow book
   2. Large bulbous balloon
   3. Green plaid shirt with stripes
   4. Dark yellow book

And these sentences:

   1. Yesterday I put on my green plaid shirt.
   2. Next week I'll sell my balloon.
   3. Just finished my bright book.
   4. Wondering at how lovely my baloon is [Note the misspelling]

Given that list of sentences, I will generate (sentence, noun phrase)
ordered pairs like this:
1,3
2,2
3,1
4,2

Or even an ordered pair of (sentence, [noun phrases]). E.g. 3,[1,4] (because
there might be an ambiguous reference to "Book")

The "shape" of this problem looks a lot like what Lucene does, but frankly I
don't have a lot of experience with textual indexing and search. I've
installed Lucene and managed to index and search my data structures, however
with the StandardIndexer I'm getting a lot of false positives.

Here is the code I have so far (I've elided the parsing code which is not
very interesting):
  https://gist.github.com/1150723

Really appreciate any and all guidance. Thanks.

Can I use Lucene to solve this problem?

Reply via email to