[jira] [Commented] (LUCENE-5476) Facet sampling

Shai Erera (JIRA) Tue, 11 Mar 2014 02:23:23 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13930111#comment-13930111
 ]


Shai Erera commented on LUCENE-5476:
------------------------------------

* Javadocs:
** From the class javadocs: _Note: the original set of hits will be available 
as documents..._
I think we should just write "the original set of hits can be retrieved from 
getOriginal.." - I don't want anyone to be confused with the wording "will be 
available as documents".
** Can you make NOT_CALCULATED static?
** Typo: samplingRato
** randomSeed: I think it should say "if 0..." not "if null".

* getMatchingDocs -- can you move the totalHits calculation to 
{{getTotalHits()}}? And then call it only {{if (sampledDocs==null)}}?

* needsSampling -- I know it was suggested to make it protected for inheritance 
purposes, but as it looks now, all members are private so I don't see how can 
one extend the class only to override this method (e.g. he won't have access to 
sampleSize even). Maybe we should keep it private and when someone asks to 
extend, we know better what needs to be protected? For instance, I think it's 
also important that we allow overriding createSampledDocList, but for now let's 
keep it simple.

* I think that we need to document somewhere (maybe in class javadocs) that the 
returned sampled docs may include empty MatchingDocs instances (i.e. when no 
docs were "sampled" from a segment). Just so that we don't surprise anyone with 
empty instances. If people work w/ MatchingDocs as they should, by obtaining an 
iterator, it shouldn't be a problem, but better document it explicitly.
** Perhaps we should also say something about the returned 
MatchingDocs.totalHits, which are the original totalHits and not the sampled 
set size?

* About carryOver:
** Doesn't it handle the TODO at the beginning of createSample?
** Why does it need to always include the first document of the first segment? 
Rather you could initialize it to -1 and if {{carryOver == -1}} set it to a 
random index within that bin? Seems more "random" to me.

* amortizedFacetCounts:
** It's a matter of style so I don't mind if you keep this like you wrote, but 
I would write it as {{if (!needsSampling() || res == null)}} and then the rest 
of the method isn't indented. Your call.
** I think it's better to allocate {{childPath}} so that the first element is 
already the dimension. See what FacetsConfig.pathToString currently does. 
Currently it results in re-allocating the String[] for every label. Then you 
can call the pathToString variant which takes the array and the length.
*** Separately, would be good if FacetsConfig had an appendElement(Appendable, 
int idx) to append a specific element to the appendable. Then you could use a 
StringBuilder once, and skip the encoding done for the entire path except the 
last element.
** Perhaps we should cap this {{(int) (res.value.doubleValue() / 
this.samplingRate)}} by e.g. the number of non-deleted documents?

* About test:
** This comment is wrong: _//there will be 5000 hits._
** Why do you initialize your own Random? It's impossible to debug the test 
like that. You should use {{random()}} instead.
** This comment is wrong? _//because a randomindexer is used, the results will 
vary each time._ -- it's not because the random indexer, but because of random 
sampling no?
** Would you mind formatting the test code? I see empty lines, curly braces 
that are sometimes open in the next line etc. I can do it too before I commit 
it.

* I see that createSample still has the scores array bug -- it returns the 
original scores array, irrespective of the sampled docs. Before you fix it, can 
you please add a testcase which covers scores and fails?

I think we're very close!

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, 
> SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5476) Facet sampling

Reply via email to