[
https://issues.apache.org/jira/browse/LUCENE-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-6077:
---------------------------------
Attachment: LUCENE-6077.patch
Here is a patch. It divides the work into 2 pieces:
- FilterCache whose responsibility is to act as a per-segment cache for
filters but doesn't make any decision about which filters should be cached
- FilterCachingPolicy, whose responsibility is to decide about whether a
filter is worth caching given the filter itself, the current segment and the
produced (uncached) DocIdSet.
FilterCache has an implementation called LRUFilterCache that accepts a maximum
size (number of cached filters) and ram usage and is going to evict
least-recently-used filters first. It has some protected methods that allow to
configure which impl should be used to cache DocIdSets (RoaringDocIdSet by
default), and how to measure ram usage of filters (the default impl uses
Accountable#ramBytesUsed if the filter implements Accountable, and falls back
to an arbitrary constant (1024) otherwise).
FilterCachingPolicy has an implementation called
UsageTrackingFilterCachingPolicy that tries to provide sensible defaults:
- it tracks the 256 most recently used filters (through their hash codes)
globally (not per segment)
- it only caches on segments whose source is a merge or addIndexes (not
flushes)
- it uses some heuristics to decide how many times a filter should appear in
the history of 256 filters in order to be cached.
The filter caching policy can be configured on a per-filter basis, so that even
if there are filters that you want to cache more aggressively than others, it
is possible to cache them all in a single FilterCache instance.
> Add a filter cache
> ------------------
>
> Key: LUCENE-6077
> URL: https://issues.apache.org/jira/browse/LUCENE-6077
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-6077.patch
>
>
> Lucene already has filter caching abilities through CachingWrapperFilter, but
> CachingWrapperFilter requires you to know which filters you want to cache
> up-front.
> Caching filters is not trivial. If you cache too aggressively, then you slow
> things down since you need to iterate over all documents that match the
> filter in order to load it into an in-memory cacheable DocIdSet. On the other
> hand, if you don't cache at all, you are potentially missing interesting
> speed-ups on frequently-used filters.
> Something that would be nice would be to have a generic filter cache that
> would track usage for individual filters and make the decision to cache or
> not a filter on a given segments based on usage statistics and various
> heuristics, such as:
> - the overhead to cache the filter (for instance some filters produce
> DocIdSets that are already cacheable)
> - the cost to build the DocIdSet (the getDocIdSet method is very expensive
> on some filters such as MultiTermQueryWrapperFilter that potentially need to
> merge lots of postings lists)
> - the segment we are searching on (flush segments will likely be merged
> right away so it's probably not worth building a cache on such segments)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]