Hi, I'm following Grant's advice on how to combine BooleanQuery and SpanQuery (http://mail-archives.apache.org/mod_mbox/lucene-java-user/201003.mbox/%3c08c90e81-1c33-487a-9e7d-2f05b2779...@apache.org%3E).
The strategy is to perform a BooleanQuery, get the document ID set and perform a SpanQuery restricted by those documents. The purpose is that I need to retrieve Spans for different terms in order to extract their respective payloads separately, but a precondition is that possibly multiple terms occur within the documents. My code looks like this: /* reader and terms are class variables and have been declared finally before */ Reader reader = ...; List<String> terms = ... /* perform the BooleanQuery and store the document IDs in a BitSet */ BitSet bits = new BitSet(reader.maxDoc()); AllDocCollector collector = new AllDocCollector BooleanQuery bq = new BooleanQuery(); for (String term : terms) bq.add(new org.apache.lucene.search.RegexpQuery(new Term(config.getFieldname(), term)), Occur.MUST); IndexSearcher searcher = new IndexSearcher(reader); for (ScoreDoc doc : collector.getHits()) bits.set(doc.doc); /* get the spans for each term separately */ for (String term : terms) { String payloads = retrieveSpans(term, bits); // process and print payloads for term ... } def String retrieveSpans(String term, BitSet bits) { StringBuilder payloads = new StringBuilder(); Map<Term, TermContext> termContexts = new HashMap<>(); Spans spans; SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper<>(new RegexpQuery(new Term("text", term))).rewrite(reader); for (AtomicReaderContext atomic : reader.leaves()) { spans = sq.getSpans(atomic, new DocIdBitSet(bits), termContexts); while (luceneSpans.next()) { // extract and store payloads in 'payloads' StringBuilder } } return payloads.toString(); } This construction seemed to be working fine at first, but I noticed a disturbing behaviour: for many terms, the BooleanQuery when fed with one RegexpQuery only matches a larger number of documents than the SpanQuery constructed from the same RegexpQuery. With the BooleanQuery containing only one RegexpQuery, the number should be identical, while with multiple Queries added to the BooleanQuery, the SpanQuery should return an equal number or more results. This behaviour is reproducible reliably even after re-indexing, but not for all tokens. Does anyone have an explanation for that? Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org