[
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920897#comment-13920897
]
Shai Erera commented on LUCENE-5476:
------------------------------------
bq. Well, I do a search of course, but collect the hits using a
TotalHitCountCollector and not retrieve any stored values
That means that you evaluate the query twice, which is expensive ... but also,
this doesn't guarantee to provide a "correct" sample. So say you found out the
query matches 10M documents and you decide that 100K docs are a good sample,
you'll set the sampling ratio to 0.01 but then you apply this ratio per-segment
(as in this patch), and could easily end up with less than 100K docs (e.g. if
randomness didn't really pick 0.01 of documents in a certain segment).
I don't think we should store the minSampleSize in an int[] and move to a
bitset if we collected more docs. First, the collector works per-segment and I
think sampling should work on the entire result set. So the int[] wouldn't be
part of MatchingDocs, it'd need to be held inside the collector and then you'll
need to know where to "cut" it for each MatchingDocs instance (per-segment).
I really think a simple solution is what we should start with. RandomSamplingFC
only overrides {{.getMatchingDocs()}} and it can determine if sampling is
needed or not, given {{minSampleSize}} and the sum of totalHits from all
MatchingDocs. Then you do sampling per-segment, but with the "global picture"
in mind, and you're able to correct the sample ratio so that we come as close
to {{minSampleSize}} as possible.
To me, if we can factor in a {{maxSampleSize}} in this issue is a bonus, but I
can definitely see that happening in a separate issue, as that's a performance
thing. We should focus on giving our users a collector which produces a good
sample, otherwise it's not valuable sampling.
> Facet sampling
> --------------
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
> LUCENE-5476.patch, LUCENE-5476.patch,
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared.
> When trying to display facet counts on large datasets (>10M documents)
> counting facets is rather expensive, as all the hits are collected and
> processed.
> Sampling greatly reduced this and thus provided a nice speedup. Could it be
> brought back?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]