[jira] [Commented] (LUCENE-5476) Facet sampling

Shai Erera (JIRA) Thu, 06 Mar 2014 10:55:31 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922884#comment-13922884
 ]


Shai Erera commented on LUCENE-5476:
------------------------------------

bq. but any facet accumulation which would rely on document scores would be hit 
by the second as the scores

That's a great point Gilad. We need a test which covers that with random 
sampling collector.

bq. Is there a reason to add more randomness to one test?

It depends. I have a problem with numDocs=10,000 and percents being 10% .. it 
creates too perfect numbers if you know what I mean. I prefer a random number 
of documents to add some spice to the test. Since we're testing a random 
sampler, I don't think it makes sense to test it with a fixed seed (0xdeadbeef) 
... this collector is all about randomness, so we should stress the randomness 
done there. Given our test framework, randomness is not a big deal at all, 
since once we get a test failure, we can deterministically reproduce the 
failure (when there is no multi-threading). So I say YES, in this test I think 
we should have randomness.

But e.g. when you add a test which ensures the collector works well w/ sampled 
docs and scores, I don't think you should add randomness -- it's ok to test it 
once.

Also, in terms of test coverage, there are other cases which I think would be 
good if they were tested:

* Docs + Scores (discussed above)
* Multi-segment indexes (ensuring we work well there)
* Different number of hits per-segment (to make sure our sampling on tiny 
segments works well too)
* ...

I wouldn't for example use RandomIndexWriter because we're only testing search. 
If we want many segments, we should commit/nrt-open every few segments, disable 
merge policy etc. These can be separate, real "unit", tests.

bq. Sorry, I don't get what you mean by this.

I meant that if you set {{numDocs = atLeast(8000)}}, then the 10% sampler 
should not be hardcoded to 1,000, but {{numDocs * 0.1}}.

bq. the original totalHits .. is used

I think that's OK. In fact, if we don't record that, it would be hard to fix 
the counts no?

{quote}
There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i 
% 10) are hits. There is a REAL small chance that one of the five values will 
be entirely missed when sampling. But is that 0.8 (chance not to take a value) 
^ 2000 * 5 (any can be missing) ~ 10^-193, so that is probable not going to 
happen
{quote}

Ahh thanks, I missed that. I agree it's very improbable that one of the values 
is missing, but if we can avoid that at all it's better. First, it's not one of 
the values, we could be missing even 2 right -- really depends on randomness. I 
find this assert just redundant -- if we always expect 5, we shouldn't assert 
that we received 5. If we say that very infrequently we might get <5 and we're 
OK with it .. what's the point of asserting that at all?

bq. I renamed the sampleThreshold to sampleSize. It currently picks a 
samplingRatio that will reduce the number of hits to the sampleSize, if the 
number of hits is greater.

It looks like it hasn't changed? I mean besides the rename. So if I set 
sampleSize=100K, it's 100K whether there are 101K docs or 100M docs, right? Is 
that your intention?

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5476) Facet sampling

Reply via email to