[
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915650#comment-13915650
]
Rob Audenaerde commented on LUCENE-5476:
----------------------------------------
I'm currently expermenting with this. To increase the speed it seems logical to
me the {{FacetsCollector}} needs to return less hits. I have a slighly modified
version that I will attach.
It uses a sampling technique that divides the total hits in to 'bins' of a
given size; and takes one sample of that bin. I have implemented it as keeping
that one sample as 'hit' of the search if it was a hit, and clearing all other
bits. See the attached file.
By using this technique the distribution of the results should not be altered
too much, while the performance gains can be significant.
A quick test revealed that for 1M results and binsize 500, the sampled version
is twice as fast.
The problem it that the resulting {{FacetResult}}s are not correct, as the
number of hits is reduced. This can be fixed afterwards for counting facets by
multiplying with the binsize; but for other facets it will be more difficult or
will require other approaches.
What do you think?
> Facet sampling
> --------------
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Rob Audenaerde
>
> With LUCENE-5339 facet sampling disappeared.
> When trying to display facet counts on large datasets (>10M documents)
> counting facets is rather expensive, as all the hits are collected and
> processed.
> Sampling greatly reduced this and thus provided a nice speedup. Could it be
> brought back?
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]