[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919240#comment-13919240
 ] 

Shai Erera commented on LUCENE-5476:
------------------------------------

Looks good Rob. I apologize for not mentioning this, but now that 
XORShift64Random is a public class, it has to have jdocs on all methods and 
ctors, otherwise documentation linting will fail. Can you please add some in 
your next patch?

About XORShift64Random.nextInt() -- modulo is a bit expensive right? I wonder 
if there's a way to generate that faster ... e.g. if SampledDcos did something 
like {{random.randomLong() & (binsize-1)}}? I haven't fully thought how that 
changes the distribution of the generated numbers - hopefully it doesn't. Would 
you mind giving it a try? And of course {{binsize-1}} can be computed once in 
the ctor.

Also, are you planning to write some unit tests? You can either start with one 
of the existing tests or look at old tests. I think maybe start new will be 
easier. The key point is that in order to test sampling, we need to index many 
documents to make the samples _count_. So e.g. we want to make sure that if we 
give 10% sample ratio, then a category's count is ~10% of the expected count.

In the old tests we had issues w/ false positives - tests that failed on these 
asserts just because the nature of sampling isn't deterministic. Would be good 
if we can craft the test such that on one hand it does test sampling, but on 
the other hand doesn't cause unwanted noise.

I do think we can optimize SampledDocs to not use FixedBitSet even in the case 
of out-of-order collection (no scores) by keeping an int[] or some other 
compressed array, especially when the sample ratio is so small. We can do that 
later though - we need tests first.

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, 
> SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to