[
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919559#comment-13919559
]
Shai Erera commented on LUCENE-5476:
------------------------------------
That's good point Gilad. I think once this gets into Lucene it means other
people will use it and we should offer a good sampling collector that works in
more than one extreme case (always tons of results) even if it's well
documented. One of the problems is that when you have a query Q, you don't know
in advance how many documents it's going to match.
That's where the min/maxDocsToEvaluate came in handy in the previous solution
-- it made SamplingFC smart and adaptive. If the query matched very few
documents, not only it didn't bother to sample and save CPU, it also didn't
come up w/ a crappy sample (as Gilad says, 10 docs). The previous sampling
worked on the entire query, the new collector can be used to use these
threshold per-segment.
But I feel that this has to give a qualitative solution -- the sample has be
meaningful in order to be considered as representative at all, and we should
let the app specify what "meaningful" is to it, in the form of
minDocsToEvaluate(PerSegment).
And since sampling is about improving speed, we should also let the app specify
a maxDocsToEvaluate(PerSegment), so a 1% sample still doesn't end up evaluating
millions of documents.
Robert, I agree w/ your comment on XORShiftRandom - it was a mistake to suggest
moving it under core.
Rob, I feel like I've thrown you back and forth with the patch. If you want, I
can take a stab at making the changes to SFC.
> Facet sampling
> --------------
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
> LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java,
> SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared.
> When trying to display facet counts on large datasets (>10M documents)
> counting facets is rather expensive, as all the hits are collected and
> processed.
> Sampling greatly reduced this and thus provided a nice speedup. Could it be
> brought back?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]