[
https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665018#comment-13665018
]
Gilad Barkai commented on LUCENE-5015:
--------------------------------------
Sampling, with its defaults, has its toll.
In its defaults, Sampling aims to produce the exact top-K results for each
request, as if a {{StandardFacetAccumulator}} would have been used. Meaning it
aims at producing the same top-K with the same counts.
The process begins with sampling the result set and computers the top-*cK*
candidates for each of the *M* facet requests, producing amortized results.
That part is faster than {{StandardFacetAccumulator}} because less documents'
facets information gets processed.
The next part is the "fixing", using a {{SampleFixer}} retrieved from a
{{Sampler}}, in which "fixed" counts are produced which correlate better with
the original document result set, rather than the sampled one. The default (and
currently only implementation) for such fixer is {{TakmiSampleFixer}} which
produced _exact_ counts for each of the *cK* candidates for each of the *M*
facet requests. The counts are not computed against the facet information of
each document, but rather matching the skiplist of the drill-down term, of each
such candidate category with the bitset of the (actual) document results. The
amount of matches is the count.
This is equivalent to total-hit collector with a drilldown query for the
candidate category over original query.
There's tipping point in which not sampling is faster than sampling and fixing
using *c* x *K* x *M* skiplists matches against the bitset representing the
document results. *c* defaults to 2 (see overSampleFactor in SamplingParams);
Over-sampling (a.k.a *c*) is important for exact counts, as it is conceivable
that the accuracy of a sampled top-k is not 100%, but according to some
measures we once ran it is very likely that the true top-K results are within
the sampled *2K* results. Fixing those 2K with their actual counts and
re-sorting them accordingly yields much more accurate top-K.
E.g Requesting 5 count requests for top-10 with overSampleFactor of 2, results
in 100 skiplist matching against the document results bitset.
If amortized results suffice, a different {{SampleFixer}} could be coded -
which E.g amortize the true count from the sampling ration. E.g if category C
got count of 3, and the sample was of 1,000 results out of a 1,000,000 than the
"AmortizedSampleFixer" would fix the count of C to be 3,000.
Such fixing is very fast, and the overSampleFactor should be set to 1.0.
Edit:
I now see that it is not that easy to code a different SampleFixer, nor get it
the information needed for the amortized result fixing as suggested above.
I'll try to open the API some and make it more convenient.
> Unexpected performance difference between SamplingAccumulator and
> StandardFacetAccumulator
> ------------------------------------------------------------------------------------------
>
> Key: LUCENE-5015
> URL: https://issues.apache.org/jira/browse/LUCENE-5015
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/facet
> Affects Versions: 4.3
> Reporter: Rob Audenaerde
> Priority: Minor
>
> I have an unexpected performance difference between the SamplingAccumulator
> and the StandardFacetAccumulator.
> The case is an index with about 5M documents and each document containing
> about 10 fields. I created a facet on each of those fields. When searching to
> retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is
> about twice as fast as the StandardFacetAccumulator. This is expected and a
> nice speed-up.
> However, when I use more CountFacetRequests to retrieve facet-counts for more
> than one field, the speeds of the SampingAccumulator decreases, to the point
> where the StandardFacetAccumulator is faster.
> {noformat}
> FacetRequests Sampling Standard
> 1 391 ms 1100 ms
> 2 531 ms 1095 ms
> 3 948 ms 1108 ms
> 4 1400 ms 1110 ms
> 5 1901 ms 1102 ms
> {noformat}
> Is this behaviour normal? I did not expect it, as the SamplingAccumulator
> needs to do less work?
> Some code to show what I do:
> {code}
> searcher.search( facetsQuery, facetsCollector );
> final List<FacetResult> collectedFacets =
> facetsCollector.getFacetResults();
> {code}
> {code}
> final FacetSearchParams facetSearchParams = new FacetSearchParams(
> facetRequests );
> FacetsCollector facetsCollector;
> if ( isSampled )
> {
> facetsCollector =
> FacetsCollector.create( new SamplingAccumulator( new
> RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) );
> }
> else
> {
> facetsCollector = FacetsCollector.create( FacetsAccumulator.create(
> facetSearchParams, searcher.getIndexReader(), taxo ) );
> {code}
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]