Hi, Code here ignores PhraseQuery (PQ) 's positions:
int[] pp = PQ.getPositions(); These positions have extra gaps when stop words are removed. To accommodate for this, the overall extra gap can be added to the slope: int gap = (pp[pp.length] - pp[0]) - (pp.length - 1); // (+/- boundary cases) slope += gap; I think this is less accurate than PQ: It does not specify the exact position of the stop word. For example, assume original text: A B S D and S is a stop word. PQ: A B S D would match A S B D would not Span Near query: both would match. Perhaps there's a way around this too that I am not aware of. Also, this code suggestion simplifies in the case that the analyzer in effect may emit more than one term at the same position - for example when expanding the query with synonyms, or when keeping originals and stemmed forms - in that case just comparing pp[0] and pp[pp.length-1] is insufficient, and the positions should be examined while looping the phrase terms, something like this: int dpos = pp[i+1] - p[i]; // (i>0) if (dpos > 1) slope += (dpos -1); Haven't tested this - just to give you an idea what to try next. Doron On Tue, Jan 31, 2012 at 10:48 PM, Paul Allan Hill <p...@metajure.com> wrote: > In Lucene, 3.4 I recently implemented "Translating PhraseQuery to > SpanNearQuery" (see Lucene in Action, page 220) because I wanted _order_ to > matter. > > Here is my exact code called from getFieldsQuery once I know I'm looking > at a PhraseQuery, but I think it is exactly from the book. > > static Query buildSpanNearQuery(PhraseQuery phraseQ, int slop) { > Term[] terms = phraseQ.getTerms(); > SpanTermQuery[] clauses = new SpanTermQuery[terms.length]; > for (int i = 0; i < terms.length; i++) { > clauses[i] = new SpanTermQuery(terms[i]); > } > SpanNearQuery query = new SpanNearQuery(clauses, slop, > PHRASE_ORDER_MATTERS); > return query; > } > > I put in my own QueryParser and things looked good until I try a phrase > with stop words. > Using the old PhraseQuery I got results on a phrase with stop words > without extending the slop, but with SpanNearQuery unless the query > includes some slop, nothing is found. > This conflicts with the typical use case of a user taking a phrase, > pasting into the search bar with quotes and expecting to find his document. > I can't just add some more slop, because it depends on how many stop words > are in any sequence in the phrase. > > Any suggestions on how to solve the problem of combining the idea of > SpanNear (so that words in order in a phrase is better) with text that has > stop words removed, so that I can to support the simple use of quotes for > exact quoted text matching? > > Any Ideas? > > -Paul > >