[
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13930111#comment-13930111
]
Shai Erera commented on LUCENE-5476:
------------------------------------
* Javadocs:
** From the class javadocs: _Note: the original set of hits will be available
as documents..._
I think we should just write "the original set of hits can be retrieved from
getOriginal.." - I don't want anyone to be confused with the wording "will be
available as documents".
** Can you make NOT_CALCULATED static?
** Typo: samplingRato
** randomSeed: I think it should say "if 0..." not "if null".
* getMatchingDocs -- can you move the totalHits calculation to
{{getTotalHits()}}? And then call it only {{if (sampledDocs==null)}}?
* needsSampling -- I know it was suggested to make it protected for inheritance
purposes, but as it looks now, all members are private so I don't see how can
one extend the class only to override this method (e.g. he won't have access to
sampleSize even). Maybe we should keep it private and when someone asks to
extend, we know better what needs to be protected? For instance, I think it's
also important that we allow overriding createSampledDocList, but for now let's
keep it simple.
* I think that we need to document somewhere (maybe in class javadocs) that the
returned sampled docs may include empty MatchingDocs instances (i.e. when no
docs were "sampled" from a segment). Just so that we don't surprise anyone with
empty instances. If people work w/ MatchingDocs as they should, by obtaining an
iterator, it shouldn't be a problem, but better document it explicitly.
** Perhaps we should also say something about the returned
MatchingDocs.totalHits, which are the original totalHits and not the sampled
set size?
* About carryOver:
** Doesn't it handle the TODO at the beginning of createSample?
** Why does it need to always include the first document of the first segment?
Rather you could initialize it to -1 and if {{carryOver == -1}} set it to a
random index within that bin? Seems more "random" to me.
* amortizedFacetCounts:
** It's a matter of style so I don't mind if you keep this like you wrote, but
I would write it as {{if (!needsSampling() || res == null)}} and then the rest
of the method isn't indented. Your call.
** I think it's better to allocate {{childPath}} so that the first element is
already the dimension. See what FacetsConfig.pathToString currently does.
Currently it results in re-allocating the String[] for every label. Then you
can call the pathToString variant which takes the array and the length.
*** Separately, would be good if FacetsConfig had an appendElement(Appendable,
int idx) to append a specific element to the appendable. Then you could use a
StringBuilder once, and skip the encoding done for the entire path except the
last element.
** Perhaps we should cap this {{(int) (res.value.doubleValue() /
this.samplingRate)}} by e.g. the number of non-deleted documents?
* About test:
** This comment is wrong: _//there will be 5000 hits._
** Why do you initialize your own Random? It's impossible to debug the test
like that. You should use {{random()}} instead.
** This comment is wrong? _//because a randomindexer is used, the results will
vary each time._ -- it's not because the random indexer, but because of random
sampling no?
** Would you mind formatting the test code? I see empty lines, curly braces
that are sometimes open in the next line etc. I can do it too before I commit
it.
* I see that createSample still has the scores array bug -- it returns the
original scores array, irrespective of the sampled docs. Before you fix it, can
you please add a testcase which covers scores and fails?
I think we're very close!
> Facet sampling
> --------------
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
> LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java,
> SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared.
> When trying to display facet counts on large datasets (>10M documents)
> counting facets is rather expensive, as all the hits are collected and
> processed.
> Sampling greatly reduced this and thus provided a nice speedup. Could it be
> brought back?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]