Thanks for the response. I don't necessarily know ahead of time what entities 
will be searched on, or even that things that are being searched for are 
actually truly entities. Though, for a lot of cases where I do know what users 
want to search on, this could help...

-Michael

From: Tri Cao [mailto:tm...@me.com]
Sent: Friday, July 11, 2014 11:25 AM
To: java-user@lucene.apache.org
Cc: java-user@
Subject: Finding words not followed by other words

This is actually a tough problem in general: polysemy sense disambiguation. In 
your case, I think it's more like you'll probably need to do some named entity 
resolution to differentiate "George Washington" from "George Washington Carver" 
as they are two different entities.

Do you have a list of all the entity names in your corpus (either manually 
curated or by some pattern matching?). If you do, one thing you can do is to 
write a tokenizer that emit one token for each entity. So, for example, "George 
Washington" string emits a token like _George_Washington_, "George Washington 
Carver" emits _George Washington_Carver_, etc.

There are some open source NLP library that has does this, but the quality 
varies, as it will most likely depend on your domain and training data set.

Hope this helps,
Tri

On Jul 11, 2014, at 07:20 AM, Michael Ryan 
<mr...@moreover.com<mailto:mr...@moreover.com>> wrote:
I'm trying to solve the following problem...

I have 3 documents that contain the following contents:
1: "George Washington Carver blah blah blah."
2: "George Washington blah blah blah."
3: "George Washington Carver blah blah blah. George Washington blah blah blah."

I want to create a query that matches documents 2 and 3, but not 1. That is, I 
want to find documents that mention "George Washington". It's okay if they also 
mention "George Washington Carver", but I don't want documents that only 
mention "George Washington Carver". So simply doing something like this does 
not solve it:
"George Washington" NOT "George Washington Carver"

Is there a Query type that does this out of the box? I've looked at the various 
types of span queries, but none of them seem to do this. I think it should be 
theoretically possible given the position data that Lucene stores...

-Michael

Reply via email to