My organization is looking to solve a difficult problem, and I believe that Lucene is a close fit (although perhaps it is not). However I'm not sure exactly how to approach this problem.
The problem is this: given a small set of fixed noun phrases and a much larger set of human generated short sentences, determine whether the sentences refer to those noun phrases. For example, perhaps I have these noun phrases: 1. Bright yellow book 2. Large bulbous balloon 3. Green plaid shirt with stripes 4. Dark yellow book And these sentences: 1. Yesterday I put on my green plaid shirt. 2. Next week I'll sell my balloon. 3. Just finished my bright book. 4. Wondering at how lovely my baloon is [Note the misspelling] Given that list of sentences, I will generate (sentence, noun phrase) ordered pairs like this: 1,3 2,2 3,1 4,2 Or even an ordered pair of (sentence, [noun phrases]). E.g. 3,[1,4] (because there might be an ambiguous reference to "Book") The "shape" of this problem looks a lot like what Lucene does, but frankly I don't have a lot of experience with textual indexing and search. I've installed Lucene and managed to index and search my data structures, however with the StandardIndexer I'm getting a lot of false positives. Here is the code I have so far (I've elided the parsing code which is not very interesting): https://gist.github.com/1150723 Really appreciate any and all guidance. Thanks.