[jira] [Commented] (LUCENE-8681) Prorated early termination

Robert Muir (JIRA) Fri, 15 Feb 2019 14:12:50 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769792#comment-16769792
 ]


Robert Muir commented on LUCENE-8681:
-------------------------------------

I agree with the first suggestion, too. Let's put it beside totalHitsThreshold 
in TopFieldCollector which is the same thing, only for counts. And add your 
sugar method to make this easier to use. I wonder if either of these has really 
has good reason to be a threshold in our public API, they seem pretty binary to 
me (exact or inexact counts, exact or inexact results) and could just be 
combined into a single enum with 3 values.

Just some notes on your other ideas about "config class":

{quote}
we don't have optional args and default args in Java
{quote}

Optional args are possible with additional signatures that plumb defaults like 
today. I dislike "config" classes where we can have reasonable defaults. 
Additional classes forces additional indirection because now you have to go 
look somewhere else for how to use the thing: its harder. So I don't think a 
user should have to deal with this to pass "basic" stuff like the number and 
order of hits. Maybe it makes sense for expert parameters, so only experts have 
to deal with it.

But how many expert parameters are we talking about? doMaxScore and 
trackTotalHits (did you mean totalHitsThreshold?) aren't parameters in master. 
So the only existing "expert" option we have left is really doDocScores. Its a 
strange one anyway, perhaps it could be solved better by a special SortField or 
something like that, or could simply be removed from here to simplify 
IndexSearcher (require user to make CollectorManager, or maybe plumb it to your 
new proposed static method)


> Prorated early termination
> --------------------------
>
>                 Key: LUCENE-8681
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8681
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Mike Sokolov
>            Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8681) Prorated early termination

Reply via email to