[jira] [Commented] (LUCENE-5476) Facet sampling

Rob Audenaerde (JIRA) Wed, 05 Mar 2014 05:20:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920822#comment-13920822
 ]


Rob Audenaerde commented on LUCENE-5476:
----------------------------------------

{quote}
How do you count the number of hits before you execute the search?
{quote}
Well, I do a search of course, but collect the hits using a 
{{TotalHitCountCollector}} and not retrieve any stored values. I did not find 
any other way to determine for sure if I needed to do sampling or not. I know 
this takes time. When I first implemented it however, it was faster to do a 
count and determine whether provide exact facets or needed to sample. Not 
optimal, but it worked. And because I could use sampling in the facets, the 
total time (1 pass counting, 1 pass sampling facets) was still much less than 
the time it would take to do a exact facet and count in one pass.   

When only considering the {{samplingThreshold}}, facetting should still be 
doable without counting first. It can be done by storing the first 
{{samplingThreshold}} documents (in the addDoc) in a separate array (in the 
collector) without sampling. This way the count is not needed to decide on 
whether to sample or not as there will always be sampled. Only the sampled 
result is discarded if the total number of hits <= minSampleSize. I agree that 
this is not the nicest way to get a sample. (but can reduce the time to 
retrieve estimated facet results by 5 when using a sampling rate of 1 in 1000).

The alternative is to do it more like in your snippet (and more like the first 
approach); collect all documents and sample afterwards. This way you know the 
number of hits and adjusting the sample rate based on parameters is more 
straightforward. 

Either way is faster than using exact facets, so both ways are a win. 


> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5476) Facet sampling

Reply via email to