[
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922884#comment-13922884
]
Shai Erera commented on LUCENE-5476:
------------------------------------
bq. but any facet accumulation which would rely on document scores would be hit
by the second as the scores
That's a great point Gilad. We need a test which covers that with random
sampling collector.
bq. Is there a reason to add more randomness to one test?
It depends. I have a problem with numDocs=10,000 and percents being 10% .. it
creates too perfect numbers if you know what I mean. I prefer a random number
of documents to add some spice to the test. Since we're testing a random
sampler, I don't think it makes sense to test it with a fixed seed (0xdeadbeef)
... this collector is all about randomness, so we should stress the randomness
done there. Given our test framework, randomness is not a big deal at all,
since once we get a test failure, we can deterministically reproduce the
failure (when there is no multi-threading). So I say YES, in this test I think
we should have randomness.
But e.g. when you add a test which ensures the collector works well w/ sampled
docs and scores, I don't think you should add randomness -- it's ok to test it
once.
Also, in terms of test coverage, there are other cases which I think would be
good if they were tested:
* Docs + Scores (discussed above)
* Multi-segment indexes (ensuring we work well there)
* Different number of hits per-segment (to make sure our sampling on tiny
segments works well too)
* ...
I wouldn't for example use RandomIndexWriter because we're only testing search.
If we want many segments, we should commit/nrt-open every few segments, disable
merge policy etc. These can be separate, real "unit", tests.
bq. Sorry, I don't get what you mean by this.
I meant that if you set {{numDocs = atLeast(8000)}}, then the 10% sampler
should not be hardcoded to 1,000, but {{numDocs * 0.1}}.
bq. the original totalHits .. is used
I think that's OK. In fact, if we don't record that, it would be hard to fix
the counts no?
{quote}
There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i
% 10) are hits. There is a REAL small chance that one of the five values will
be entirely missed when sampling. But is that 0.8 (chance not to take a value)
^ 2000 * 5 (any can be missing) ~ 10^-193, so that is probable not going to
happen
{quote}
Ahh thanks, I missed that. I agree it's very improbable that one of the values
is missing, but if we can avoid that at all it's better. First, it's not one of
the values, we could be missing even 2 right -- really depends on randomness. I
find this assert just redundant -- if we always expect 5, we shouldn't assert
that we received 5. If we say that very infrequently we might get <5 and we're
OK with it .. what's the point of asserting that at all?
bq. I renamed the sampleThreshold to sampleSize. It currently picks a
samplingRatio that will reduce the number of hits to the sampleSize, if the
number of hits is greater.
It looks like it hasn't changed? I mean besides the rename. So if I set
sampleSize=100K, it's 100K whether there are 101K docs or 100M docs, right? Is
that your intention?
> Facet sampling
> --------------
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared.
> When trying to display facet counts on large datasets (>10M documents)
> counting facets is rather expensive, as all the hits are collected and
> processed.
> Sampling greatly reduced this and thus provided a nice speedup. Could it be
> brought back?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]