RE: BitSet in Filters

Uwe Schindler Tue, 12 Aug 2014 09:41:39 -0700

Hi,

in general you cannot cache Filter, you can cache their DocIdSets 
(CachingWrapperFilter is for example doing this). Lucene Queries are executed 
per segment, that means when you index new documents or update new documents, 
lucene creates new index segments. Older ones *never* change, so a DocIdSet 
(e.g. implemented by FixedBitSet) can be linked to a specifc segment of the 
index that never changes - only deletions may be added, but that's transparent 
to the filter - the deletions (given in acceptDocs to getDocIdSet) and the 
cached BitSet just need to be anded together (btw, deletions in Lucene are just 
a Filter, too).


Of course, after a while Lucene merges segments using its MergePolicy, because 
otherwise there would be too many of them. In that case several smaller 
segments (preferably those with many deletions) get merged into larger ones by 
the indexer. This is the only case when the some *new* DocIdSets need to be 
created. Large segments are unlikely to be merged, unless they have many 
deletions (caused by updates into new segments or deletions). This approach is 
used by Solr and Elasticsearch - CachingWrapperFilter is an example how to do 
this in own code.

To implement this:
- don't cache a bitset for the whole index this would indeed need you to 
recalculate the bitsets over and over
- In YourFilter.getDocIdSet() look up in your cache if the coreCacheKey of the 
given AtomicReaderContext.reader() is in your cache and if yes, reuse the 
cached DocIdSet (deletions are not relevant, you just have to apply them by 
BitsFilteredDocIdSet.wrap(cachedDocIdSet). If it's not in the cache, 
recalculate the bitset for the given AtomicReaderContext (not the whole index) 
and return it as DocIdSet instance.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Sandeep Khanzode [mailto:sandeep_khanz...@yahoo.com.INVALID]
> Sent: Tuesday, August 12, 2014 8:53 AM
> To: Lucene Users
> Subject: BitSet in Filters
> 
> Hi,
> 
> The current usage of BitSets in filters in Lucene is limited to applying only 
> on
> docIDs i.e. I can only construct a filter out of a BitSet if I have the
> DocumentIDs handy.
> 
> However, with every update/delete i.e. CRUD modification, these will
> change, and I have to again redo the whole process to fetch the latest
> docIDs.
> 
> Assume a scenario where I need to tag millions of documents with a tag like
> "Finance", "IT", "Legal", etc.
> 
> Unless, I can cache these filters in memory, the cost of constructing this 
> filter
> at run time per query is not practical. If I could map the documents to a
> numeric long identifier and put them in a BitMap, I could then cache them
> because the size reduces drastically. However, I cannot use this numeric long
> identifier in Lucene filters because it is not a docID but another regular 
> field.
> 
> Please help with this scenario. Thanks,
> 
> -----------------------
> Thanks n Regards,
> Sandeep Ramesh Khanzode


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: BitSet in Filters

Reply via email to