[ 
https://issues.apache.org/jira/browse/SOLR-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437462#comment-17437462
 ] 

Michael Gibney commented on SOLR-15760:
---------------------------------------

This change is admittedly minor and arcane; but I was working with this part of 
the code and trying to leverage the overrequest function/heuristic in different 
contexts, and found that I was confused by it and reluctant to use it in its 
current state.

It's also of course possible that I'm missing something here; so please let me 
know if you think I'm misunderstanding the current situation or finding a 
problem where there isn't one ...

The example function above -- {{f\(x)=x+(90/(7+x))}} -- was chosen somewhat 
arbitrarily, but with the goal of being comparable in its input and output 
values, but more predictable and consistently applied than the 
function/heuristic it replaces. (For non-negative integer inputs, the above is 
monotonically non-decreasing, with {{f\(0..6)=12}}, and decays to the identity 
function for {{x>=84}}).

> Improve default distributed facet overrequest function/heuristic
> ----------------------------------------------------------------
>
>                 Key: SOLR-15760
>                 URL: https://issues.apache.org/jira/browse/SOLR-15760
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>    Affects Versions: main (9.0)
>            Reporter: Michael Gibney
>            Priority: Minor
>
> In {{FacetFieldProcessor}} for distributed requests, additive 
> {{facet.overrequest}} can be specified as an integer. In the absence of an 
> explicitly specified value, a default overrequest is calculated according to 
> the following function/heuristic:
> {code:java}
>       if (fcontext.isShard()) {
>         if (freq.overrequest == -1) {
>           // add over-request if this is a shard request and if we have a 
> small offset (large offsets will already be gathering many more buckets than 
> needed)
>           if (freq.offset < 10) {
>             effectiveLimit = (long) (effectiveLimit * 1.1 + 4); // default: 
> add 10% plus 4 (to overrequest for very small limits)
>           }
>         }...
>       } ...
> {code}
> The logic of overrequesting is: there is some balance of redundant vs. unique 
> values values across different shards in a distributed context, with uneven 
> distribution across shards. This can result in a situation where values with 
> uniform distribution across shards may be erroneously excluded from results, 
> in favor of values that may occur less frequently overall, but that (due to 
> uneven distribution) occur very frequently on some nodes and not on others.
> "Refinement" is designed to ensure that all top-level values are informed by 
> data from all shards -- but in the situation that overrequesting is designed 
> to mitigate, refinement can't help for values that are excluded even from the 
> initial request phase.
> The main issue is casting a wide-enough initial net. As such, the problem is 
> more acute (calling for more of an "overrequest boost") when requesting 
> relatively small numbers of values.
> The problem with the existing code is that iiuc, there is no reason from the 
> point of view of the overrequest function/heuristic, to make a distinction 
> between {{offset}} and {{limit}}. In terms of the need for overrequest, 
> "offset=10, limit=10" should be equivalent to "limit=20" -- but the existing 
> code treats these requests differently: adding overrequest to the latter, but 
> not the former.
> I propose:
> At a minimum, for logical consistency, default distributed facet overrequest 
> should be converted to be a function/heuristic based on {{offset+limit}}.
> Having thought through this somewhat, I suggest that the overrequest function 
> should be a monotonically non-decreasing function (to avoid having an 
> arbitrary inflection point where requesting _more_ values disables 
> overrequest and results in _fewer_ values being requested in the initial 
> phase). This function could simply be a static boost (e.g., {{f\(x)=x+10}}), 
> or could be a function that decays to the identity function for higher 
> numbers of requested values where overrequest becomes unnecessary (e.g., 
> {{f\(x)=x+(90/(7+x))}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to