: >>von Willebrand<< is not the query but a document in the index.... The task : is to detect exact matches of phrases inside a query (large document) with : these phrases stored in the index.
Lemme see if i can restate your problem... You want to build a data repository in which you insert a large magnatude of "concepts" where a concept is a short phrase consisting of a few words (possibly just one word). The words in any given concept phrase may overlap (or be a super set) of the words in other concepts. Once this concept repository is built, you want to to build a black box arround it, such that people can hand your black box a "document" (ie: a research paper, a newpaper article, a short story, ... some text consisting of many many sentences) and you want your black box to then return the list of concepts that match the input document, such that the cnceptss with the highest score are concepts whose phrase appears exactly in the input document. Concepts whose phrase doesn't appear exactly in the document shoudl still be returned, but with a lower score based on how many words in the concept's phrase are found in the input document. (have i adequetly described your problem?) It's an interesting idea. can it be done with lucene? ... i can think of one kludgy mechanism for doing it but i'd be very suprised if there isn't a better way (or if there is some other software library out there that would be more suited) Build a permentant index in which each concept is a Lucene Document. these documents really only need one stored/tokenized/indexed field containing the phrase (if you want other payload fields that's up to you). Each time you are asked to analyze a Text sample and return matching phrases, run the text through your analyzer to get back a tokenstream, and for each of those tokens, use a TermDocs iterator to find out if any phrase in your concept index contains that term, and if so which ones. (you could also do this by building a boolean OR query out of all the words in your input document -- but that may run into performance limitatios if your input docs are too big, and it will try to score each concept which isn't neccessary so even for short input text it's less efficient). Now you have an (unordered) list of concepts that have something to do with your input text. Next build a RAMDirectory based index consisting of exactly one document which you build from the input text. Loop over that list of concepts you got, and build a boolean query out of each one along the lines that Daniel described: a phrase query on the whole concept phrase along with term queries for each individual word -- all optional. run each of these boolean queries against your one document RAMDirectory. the higher the score, the better that concept applies to your input text. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]