Michael Gibney created SOLR-15760:
-------------------------------------

             Summary: Improve default distributed facet overrequest 
function/heuristic
                 Key: SOLR-15760
                 URL: https://issues.apache.org/jira/browse/SOLR-15760
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
          Components: Facet Module
    Affects Versions: main (9.0)
            Reporter: Michael Gibney


In {{FacetFieldProcessor}} for distributed requests, additive 
{{facet.overrequest}} can be specified as an integer. In the absence of an 
explicitly specified value, a default overrequest is calculated according to 
the following function/heuristic:

{code:java}
      if (fcontext.isShard()) {
        if (freq.overrequest == -1) {
          // add over-request if this is a shard request and if we have a small 
offset (large offsets will already be gathering many more buck
ets than needed)
          if (freq.offset < 10) {
            effectiveLimit = (long) (effectiveLimit * 1.1 + 4); // default: add 
10% plus 4 (to overrequest for very small limits)
          }
        }...
      } ...
{code}

The logic of overrequesting is: there is some balance of redundant vs. unique 
values values across different shards in a distributed context, with uneven 
distribution across shards. This can result in a situation where values with 
uniform distribution across shards may be erroneously excluded from results, in 
favor of values that may occur less frequently overall, but that (due to uneven 
distribution) occur very frequently on some nodes and not on others.

"Refinement" is designed to ensure that all top-level values are informed by 
data from all shards -- but in the situation that overrequesting is designed to 
mitigate, refinement can't help for values that are excluded even from the 
initial request phase.

The main issue is casting a wide-enough initial net. As such, the problem is 
more acute (calling for more of an "overrequest boost") when requesting 
relatively small numbers of values.

The problem with the existing code is that iiuc, there is no reason from the 
point of view of the overrequest function/heuristic, to make a distinction 
between {{offset}} and {{limit}}. In terms of the need for overrequest, 
"offset=10, limit=10" should be equivalent to "limit=20" -- but the existing 
code treats these requests differently: adding overrequest to the latter, but 
not the former.

I propose:

At a minimum, for logical consistency, default distributed facet overrequest 
should be converted to be a function/heuristic based on {{offset+limit}}.

Having thought through this somewhat, I suggest that the overrequest function 
should be a monotonically non-decreasing function (to avoid having an arbitrary 
inflection point where requesting _more_ values disables overrequest and 
results in _fewer_ values being requested in the initial phase). This function 
could simply be a static boost (e.g., {{f(x)=x+10}}), or could be a function 
that decays to the identity function for higher numbers of requested values 
where overrequest becomes unnecessary (e.g., {{f(x)=x+(90/(7+x))}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to