Re: Statically store sub-collections for search (faceted search?)

Carsten Schnober Mon, 15 Apr 2013 02:19:33 -0700

Am 15.04.2013 10:42, schrieb Uwe Schindler:

> Not every DocIdSet supports bits(). If it returns null, then bits are not 
> supported. To enforce a bitset availabe use CachingWrapperFilter (which 
> internally uses a BitSet to cache).
> It might also happen that Filter.getDocIdSet() returns null, which means that 
> no document matches the filter.


I've been using a ChainedFilter so far. I think this should also support
bits(), right?

> AcceptDocs in Lucene are generally all non-deleted documents. For your call 
> to Filter.getDocIdSet you should therefor pass AtomicReader.getLiveDocs() and 
> not Bits.MatchAllBits.

I see. As far as I understand the documentation, getLiveDocs() returns
null if there are no deleted documents and returns the Bits matching all
available (not deleted) documents otherwise:
"Returns the Bits representing live (not deleted) docs. A set bit
indicates the doc ID has not been deleted. If this method returns null
it means there are no deleted documents."
I understand that if there are no deleted documents, I need to replace
the result (null) with Bits.MatchAllDocuments(), right? If there are
deleted documents however, I can pass on the result having all available
(not deleted) document bits set.

> You are somehow "misusing" acceptDocs and DocIdSet here, so you have to take 
> care, semantics are different:
> - For acceptDocs "null" means "all documents allowed" -> no deleted documents
> - For DocIdSet "null" means "no documents matched"

Okay, as described above, I would now pass either the result of
getLiveDocs() or Bits.MatchAllDocuments() as the acceptDocs argument to
getDocIdSet():

Map<Term, TermContext> termContexts = new HashMap<>();
AtomicReaderContext atomic = ...
ChainedFilter filter = ...

Bits allDocs = atomic.reader().getLiveDocs();
if (allDocs == null) {
  // no deleted documents
  allDocs = new Bits.MatchAllBits(atomic.reader().maxDoc());
}
Bits bits = filter.getDocIdSet(atomic, allDocs).bits();
if (bits == null) {
  // no documents matching filter
  continue; // skip this iteration
}
Spans spans = sq.getSpans(atomic, bits, termContexts);


> Finally: The trick here is to make Spans think that there are more deleted 
> docs than AtomicReader returns as deleted docs (if you would directly pass 
> getLiveDocs() to getSpans()). The filter is applied to the deleted docs 
> BitSet.

Yep, I think I've tried to simulate that now. It is pretty hard to test
this systematically, so please let me know if you see an obvious flaw in
my code. Thanks!
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Statically store sub-collections for search (faceted search?)

Reply via email to