[
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919240#comment-13919240
]
Shai Erera commented on LUCENE-5476:
------------------------------------
Looks good Rob. I apologize for not mentioning this, but now that
XORShift64Random is a public class, it has to have jdocs on all methods and
ctors, otherwise documentation linting will fail. Can you please add some in
your next patch?
About XORShift64Random.nextInt() -- modulo is a bit expensive right? I wonder
if there's a way to generate that faster ... e.g. if SampledDcos did something
like {{random.randomLong() & (binsize-1)}}? I haven't fully thought how that
changes the distribution of the generated numbers - hopefully it doesn't. Would
you mind giving it a try? And of course {{binsize-1}} can be computed once in
the ctor.
Also, are you planning to write some unit tests? You can either start with one
of the existing tests or look at old tests. I think maybe start new will be
easier. The key point is that in order to test sampling, we need to index many
documents to make the samples _count_. So e.g. we want to make sure that if we
give 10% sample ratio, then a category's count is ~10% of the expected count.
In the old tests we had issues w/ false positives - tests that failed on these
asserts just because the nature of sampling isn't deterministic. Would be good
if we can craft the test such that on one hand it does test sampling, but on
the other hand doesn't cause unwanted noise.
I do think we can optimize SampledDocs to not use FixedBitSet even in the case
of out-of-order collection (no scores) by keeping an int[] or some other
compressed array, especially when the sample ratio is so small. We can do that
later though - we need tests first.
> Facet sampling
> --------------
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
> LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java,
> SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared.
> When trying to display facet counts on large datasets (>10M documents)
> counting facets is rather expensive, as all the hits are collected and
> processed.
> Sampling greatly reduced this and thus provided a nice speedup. Could it be
> brought back?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]