[
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922411#comment-13922411
]
Rob Audenaerde commented on LUCENE-5476:
----------------------------------------
Thanks,
{quote}
when !sampleNeeded() there's a call to super.getMatchingDocs(), this may be
redundant method call as 5 lines above we call it, and the code always compute
the totalHits first. Perhaps the original matching docs could be stored as a
member? This would also help for some implementations of correcting the sampled
facet results.
totalHits is redundantly computed again in line 147-152
{quote}
How could I have missed this... Must take a break I think.
{{createSample}}
I always take the first document, as I did not implement carrying-over of the
segments. If I would pick a random index and this index would be greater than
the number of document in the segment, the segment would not be sampled. This
results is 'too few' sampled documents. Taking the first always might result in
'too many' but that gave a better overall distribution and average.
I think your argument about not-so-random documents and the fact that
carry-over should not be that hard, I should implement carry over anyway.
> Facet sampling
> --------------
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared.
> When trying to display facet counts on large datasets (>10M documents)
> counting facets is rather expensive, as all the hits are collected and
> processed.
> Sampling greatly reduced this and thus provided a nice speedup. Could it be
> brought back?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]