Thanks for the response. I don't necessarily know ahead of time what entities will be searched on, or even that things that are being searched for are actually truly entities. Though, for a lot of cases where I do know what users want to search on, this could help...
-Michael From: Tri Cao [mailto:tm...@me.com] Sent: Friday, July 11, 2014 11:25 AM To: java-user@lucene.apache.org Cc: java-user@ Subject: Finding words not followed by other words This is actually a tough problem in general: polysemy sense disambiguation. In your case, I think it's more like you'll probably need to do some named entity resolution to differentiate "George Washington" from "George Washington Carver" as they are two different entities. Do you have a list of all the entity names in your corpus (either manually curated or by some pattern matching?). If you do, one thing you can do is to write a tokenizer that emit one token for each entity. So, for example, "George Washington" string emits a token like _George_Washington_, "George Washington Carver" emits _George Washington_Carver_, etc. There are some open source NLP library that has does this, but the quality varies, as it will most likely depend on your domain and training data set. Hope this helps, Tri On Jul 11, 2014, at 07:20 AM, Michael Ryan <mr...@moreover.com<mailto:mr...@moreover.com>> wrote: I'm trying to solve the following problem... I have 3 documents that contain the following contents: 1: "George Washington Carver blah blah blah." 2: "George Washington blah blah blah." 3: "George Washington Carver blah blah blah. George Washington blah blah blah." I want to create a query that matches documents 2 and 3, but not 1. That is, I want to find documents that mention "George Washington". It's okay if they also mention "George Washington Carver", but I don't want documents that only mention "George Washington Carver". So simply doing something like this does not solve it: "George Washington" NOT "George Washington Carver" Is there a Query type that does this out of the box? I've looked at the various types of span queries, but none of them seem to do this. I think it should be theoretically possible given the position data that Lucene stores... -Michael