[
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915796#comment-13915796
]
Rob Audenaerde commented on LUCENE-5476:
----------------------------------------
Thanks guys for the feedback (also on my language skills, I need to improve my
English ;))
{quote}
It might be good to allow passing the random seed, for repeatable results?
{quote}
Yes! This is very sensible for testing and more 'stable' screenresults and I
will add this.
{quote}
Another option, which would save the 2nd pass, would be to do the sampling
during Docs.addDoc.
{quote}
I considered sampling on the 'addDocument' but I figured it would be more
expensive as then for each hit we need to do a random() calculation.
{quote}
I think SamplingFC.createDocs should return a declared SampledDocs (see later)
instead of anonymous class
{quote}
I also considered this. It is far better for clarity-sake but it also costs a
copy of the original. I will try some approaches and will make sure the
sampling is only done once.
{quote}
I like that this impl samples per-segment as it allows to tune the sample on a
per-segment basis. E.g. small segments (as in NRT) probably don't need to be
sampled at all. If we allow passing different parameters such as sampleRatio,
min/maxSampleSize, we could tune sampling per-segment.
{quote}
This was more or less by accident, but indeed seems useful. All segments need
the same ratio of sampling though, else it would be really hard to correct the
counts afterwards. (Or am I missing something here?)
{quote}
Maybe wrap all the parameters in a SamplingConfig?
{quote}
Yes. Very useful and makes it more stable.
{quote}
The old implementation let you specify different parameters such as sample
size, minimum number of documents to evaluate, maximum number of documents to
evaluate etc
{quote}
The old style sampling indeed had a fixed sample size, which I found very
useful. However, I have not yet found a way to implement this as I do not know
the total number of results when I start facetting, so I cannot determine the
samplingRatio. I could of course first count all results, but that also
impacts performance as I would need two passes. I will give it some more
thought, but maybe you have an idea on how to accomplish this in a better way?
> Facet sampling
> --------------
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Rob Audenaerde
> Attachments: SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared.
> When trying to display facet counts on large datasets (>10M documents)
> counting facets is rather expensive, as all the hits are collected and
> processed.
> Sampling greatly reduced this and thus provided a nice speedup. Could it be
> brought back?
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]