Boolean and SpanQuery: different results

Carsten Schnober Thu, 13 Dec 2012 07:49:39 -0800

Hi,
I'm following Grant's advice on how to combine BooleanQuery and
SpanQuery
(http://mail-archives.apache.org/mod_mbox/lucene-java-user/201003.mbox/%3c08c90e81-1c33-487a-9e7d-2f05b2779...@apache.org%3E).


The strategy is to perform a BooleanQuery, get the document ID set and
perform a SpanQuery restricted by those documents. The purpose is that I
need to retrieve Spans for different terms in order to extract their
respective payloads separately, but a precondition is that possibly
multiple terms occur within the documents. My code looks like this:

/* reader and terms are class variables and have been declared finally
before */
Reader reader = ...;
List<String> terms = ...

/* perform the BooleanQuery and store the document IDs in a BitSet */
BitSet bits = new BitSet(reader.maxDoc());
AllDocCollector collector = new AllDocCollector
BooleanQuery bq = new BooleanQuery();
for (String term : terms)
  bq.add(new org.apache.lucene.search.RegexpQuery(new
Term(config.getFieldname(), term)), Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
for (ScoreDoc doc : collector.getHits())
  bits.set(doc.doc);

/* get the spans for each term separately */
for (String term : terms) {
  String payloads = retrieveSpans(term, bits);
  // process and print payloads for term ...
}

def String retrieveSpans(String term, BitSet bits) {
  StringBuilder payloads = new StringBuilder();
  Map<Term, TermContext> termContexts = new HashMap<>();
  Spans spans;
  SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper<>(new
RegexpQuery(new Term("text", term))).rewrite(reader);

  for (AtomicReaderContext atomic : reader.leaves()) {  
    spans = sq.getSpans(atomic, new DocIdBitSet(bits), termContexts);
    while (luceneSpans.next()) {
      // extract and store payloads in 'payloads' StringBuilder
    }
  }
  return payloads.toString();
}


This construction seemed to be working fine at first, but I noticed a
disturbing behaviour: for many terms, the BooleanQuery when fed with one
RegexpQuery only matches a larger number of documents than the SpanQuery
constructed from the same RegexpQuery.
With the BooleanQuery containing only one RegexpQuery, the number should
be identical, while with multiple Queries added to the BooleanQuery, the
SpanQuery should return an equal number or more results. This behaviour
is reproducible reliably even after re-indexing, but not for all tokens.
Does anyone have an explanation for that?

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Boolean and SpanQuery: different results

Reply via email to