[ https://issues.apache.org/jira/browse/SOLR-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437462#comment-17437462 ]
Michael Gibney commented on SOLR-15760: --------------------------------------- This change is admittedly minor and arcane; but I was working with this part of the code and trying to leverage the overrequest function/heuristic in different contexts, and found that I was confused by it and reluctant to use it in its current state. It's also of course possible that I'm missing something here; so please let me know if you think I'm misunderstanding the current situation or finding a problem where there isn't one ... The example function above -- {{f\(x)=x+(90/(7+x))}} -- was chosen somewhat arbitrarily, but with the goal of being comparable in its input and output values, but more predictable and consistently applied than the function/heuristic it replaces. (For non-negative integer inputs, the above is monotonically non-decreasing, with {{f\(0..6)=12}}, and decays to the identity function for {{x>=84}}). > Improve default distributed facet overrequest function/heuristic > ---------------------------------------------------------------- > > Key: SOLR-15760 > URL: https://issues.apache.org/jira/browse/SOLR-15760 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Facet Module > Affects Versions: main (9.0) > Reporter: Michael Gibney > Priority: Minor > > In {{FacetFieldProcessor}} for distributed requests, additive > {{facet.overrequest}} can be specified as an integer. In the absence of an > explicitly specified value, a default overrequest is calculated according to > the following function/heuristic: > {code:java} > if (fcontext.isShard()) { > if (freq.overrequest == -1) { > // add over-request if this is a shard request and if we have a > small offset (large offsets will already be gathering many more buckets than > needed) > if (freq.offset < 10) { > effectiveLimit = (long) (effectiveLimit * 1.1 + 4); // default: > add 10% plus 4 (to overrequest for very small limits) > } > }... > } ... > {code} > The logic of overrequesting is: there is some balance of redundant vs. unique > values values across different shards in a distributed context, with uneven > distribution across shards. This can result in a situation where values with > uniform distribution across shards may be erroneously excluded from results, > in favor of values that may occur less frequently overall, but that (due to > uneven distribution) occur very frequently on some nodes and not on others. > "Refinement" is designed to ensure that all top-level values are informed by > data from all shards -- but in the situation that overrequesting is designed > to mitigate, refinement can't help for values that are excluded even from the > initial request phase. > The main issue is casting a wide-enough initial net. As such, the problem is > more acute (calling for more of an "overrequest boost") when requesting > relatively small numbers of values. > The problem with the existing code is that iiuc, there is no reason from the > point of view of the overrequest function/heuristic, to make a distinction > between {{offset}} and {{limit}}. In terms of the need for overrequest, > "offset=10, limit=10" should be equivalent to "limit=20" -- but the existing > code treats these requests differently: adding overrequest to the latter, but > not the former. > I propose: > At a minimum, for logical consistency, default distributed facet overrequest > should be converted to be a function/heuristic based on {{offset+limit}}. > Having thought through this somewhat, I suggest that the overrequest function > should be a monotonically non-decreasing function (to avoid having an > arbitrary inflection point where requesting _more_ values disables > overrequest and results in _fewer_ values being requested in the initial > phase). This function could simply be a static boost (e.g., {{f\(x)=x+10}}), > or could be a function that decays to the identity function for higher > numbers of requested values where overrequest becomes unnecessary (e.g., > {{f\(x)=x+(90/(7+x))}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org