[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919559#comment-13919559
 ] 

Shai Erera commented on LUCENE-5476:
------------------------------------

That's good point Gilad. I think once this gets into Lucene it means other 
people will use it and we should offer a good sampling collector that works in 
more than one extreme case (always tons of results) even if it's well 
documented. One of the problems is that when you have a query Q, you don't know 
in advance how many documents it's going to match.

That's where the min/maxDocsToEvaluate came in handy in the previous solution 
-- it made SamplingFC smart and adaptive. If the query matched very few 
documents, not only it didn't bother to sample and save CPU, it also didn't 
come up w/ a crappy sample (as Gilad says, 10 docs). The previous sampling 
worked on the entire query, the new collector can be used to use these 
threshold per-segment.

But I feel that this has to give a qualitative solution -- the sample has be 
meaningful in order to be considered as representative at all, and we should 
let the app specify what "meaningful" is to it, in the form of 
minDocsToEvaluate(PerSegment).

And since sampling is about improving speed, we should also let the app specify 
a maxDocsToEvaluate(PerSegment), so a 1% sample still doesn't end up evaluating 
millions of documents.

Robert, I agree w/ your comment on XORShiftRandom - it was a mistake to suggest 
moving it under core.

Rob, I feel like I've thrown you back and forth with the patch. If you want, I 
can take a stab at making the changes to SFC.

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, 
> SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to