[
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920822#comment-13920822
]
Rob Audenaerde commented on LUCENE-5476:
----------------------------------------
{quote}
How do you count the number of hits before you execute the search?
{quote}
Well, I do a search of course, but collect the hits using a
{{TotalHitCountCollector}} and not retrieve any stored values. I did not find
any other way to determine for sure if I needed to do sampling or not. I know
this takes time. When I first implemented it however, it was faster to do a
count and determine whether provide exact facets or needed to sample. Not
optimal, but it worked. And because I could use sampling in the facets, the
total time (1 pass counting, 1 pass sampling facets) was still much less than
the time it would take to do a exact facet and count in one pass.
When only considering the {{samplingThreshold}}, facetting should still be
doable without counting first. It can be done by storing the first
{{samplingThreshold}} documents (in the addDoc) in a separate array (in the
collector) without sampling. This way the count is not needed to decide on
whether to sample or not as there will always be sampled. Only the sampled
result is discarded if the total number of hits <= minSampleSize. I agree that
this is not the nicest way to get a sample. (but can reduce the time to
retrieve estimated facet results by 5 when using a sampling rate of 1 in 1000).
The alternative is to do it more like in your snippet (and more like the first
approach); collect all documents and sample afterwards. This way you know the
number of hits and adjusting the sample rate based on parameters is more
straightforward.
Either way is faster than using exact facets, so both ways are a win.
> Facet sampling
> --------------
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
> LUCENE-5476.patch, LUCENE-5476.patch,
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared.
> When trying to display facet counts on large datasets (>10M documents)
> counting facets is rather expensive, as all the hits are collected and
> processed.
> Sampling greatly reduced this and thus provided a nice speedup. Could it be
> brought back?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]