[jira] [Commented] (SOLR-15836) Address counterintuitive behavior of JSON "terms" subfacet refinement

Michael Gibney (Jira) Fri, 07 Jan 2022 10:09:07 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-15836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470781#comment-17470781
 ]


Michael Gibney commented on SOLR-15836:
---------------------------------------

I just added a bunch of commits to the PR making it now more of a "proposed 
solution" than "demonstration of problem". I believe this PR also now addresses 
SOLR-12556 -- both by introducing a new {{iterative}} refinement method that's 
capable of generating the intuitive "correct" result there, and also by 
extending {{isBucketComplete}} to account for {{processEmpty}} (thus conforming 
to the contract of bucket-completeness for {{simple}} refinement).

The main two changes are:
# the introduction of an {{iterative}} refinement method for "terms" facets, 
which differs from the current default/only {{simple}} refinement method in 
that {{iterative}} ensures that nested facets all have the opportunity to 
contribute "their own" top buckets (with extra subsequent refinement passes, if 
necessary). The relevant section of the refguide is modified in the PR to 
reflect this addition:
{quote}{{refine}}: Relevant only for distributed requests; may be {{none}} 
(alias {{false}}), {{simple}} (alias {{true}}), or {{iterative}}. If {{none}} 
(the default), all counts and stats for this facet are returned in the initial 
pass, and there is no guarantee that returned counts or stats for a given term 
will reflect contributions from all shards. {{simple}} turns on single-pass 
distributed facet refining, invoking a second phase to retrieve any buckets 
needed for the final result from shards that did not include those buckets in 
their initial results, so that every shard contributes to every returned bucket 
in this facet and any sub-facets. Both {{simple}} and {{iterative}} refinement 
guarantee that any buckets returned will be "complete" (reflecting 
contributions from all shards); {{iterative}} is similar to {{simple}}, but 
offers a stronger guarantee that any relevant buckets will reflect 
contributions from all shards, and thus be eligible to be returned. 
({{iterative}} is most useful when used on nested facets, where refinement on 
parent facets may "uncover" new child facet values; for some use patterns, 
{{iterative}} may greatly increase accuracy, but that accuracy comes at the 
expense of potentially making more intra-cloud distributed requests).
{quote}
# the introduction of a boolean {{topLevel}} facet property, only relevant for 
nested facets in distributed mode Described in the refguide update:
{quote}For distributed requests, specifying the {{topLevel:true}} property on a 
subfacet blocks evaluation of the associated subfacet until its parent facet 
has been completely evaluated (including any refinement). When the parent facet 
has determined the exact buckets that it will return, only then will the 
subfacet be evaluated, under only those particular parent buckets (as if it 
were a "top level" facet). This allows decoupling child-level fanout from 
parental overrequest, at the expense of more intra-cluster requests.
{quote}

This should be ready for review, at least at a high level. Test coverage could 
stand to be increased (particularly for `topLevel`, whose behavior isn't as 
well-covered by trivial modification of existing tests). I left the original 
commit history intact, but if potential reviewers would prefer, I'd be willing 
to reorganize the commits to be a little less unwieldy, and force-push or open 
a new PR. 

> Address counterintuitive behavior of JSON "terms" subfacet refinement
> ---------------------------------------------------------------------
>
>                 Key: SOLR-15836
>                 URL: https://issues.apache.org/jira/browse/SOLR-15836
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 9.0, 8.11
>            Reporter: Michael Gibney
>            Assignee: Michael Gibney
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In distributed faceting, uneven distribution of terms across different shards 
> can artificially include or exclude terms (this discussion will focus on JSON 
> Facet "terms" faceting).
> This is inevitable, and can be mitigated via {{overrequest}} and 
> {{overrefine}} parameters -- respectively casting a "wider net" for "phase#1" 
> (determining the set of "terms of interest") and "phase#2" (cross-checking 
> "terms of interest" against terms that did not initially report them).
> It is possible to devise artificial situations that push the limit of what 
> {{overrefine}} is capable of mitigating, resulting in counterintuitive 
> behavior. But despite such edge cases, in general it is relatively 
> straightforward to reason about how the {{simple}} JSON Facet refinement 
> method works for "flat" (i.e., non-hierarchical) terms facets.
> This issue discusses some ways in which subfacets (hierarchical or nested 
> facets) can more readily behave counterintuitively in practical usage, and 
> possible ways to address/mitigate such behavior.
> ---------------------
> AFAICT, the {{simple}} (default, currently the only) refinement method has 
> two defining requirements:
> # there is at most _one_ refinement request issued to each shard, and
> # any buckets returned are guaranteed to have accurate counts (or perhaps 
> more generally, stats?) reflecting contributions from all shards. (this makes 
> [no 
> guarantees|https://issues.apache.org/jira/browse/SOLR-11159?focusedCommentId=16103386#comment-16103386]
>  about buckets _not_ returned that would in principle be eligible to be 
> returned).
>  
> The simplest counterintuitive case is when refinement of higher-level facets 
> uncovers more subfacets on shards that have no opportunity to influence 
> results/refinement of the child facet. I'm pretty sure it's this situation 
> that's described in [this 
> comment|https://github.com/apache/solr/blob/0287458f836e3b7ea4b2401538b29f3d2e9b6cf4/solr/core/src/test/org/apache/solr/search/facet/TestJsonFacetRefinement.java#L992-L994]
>  (by [~hossman]?):
> {code:java}
>     //   - or at the very least, if the purpose of "_l" is to give other 
> buckets a chance to "bubble up"
>     //     in phase#2, then shouldn't a "_l" refinement requests still 
> include the buckets choosen in
>     //     phase#1, and request that the shard fill them in in addition to 
> returning its own top buckets?
> {code}
> The proposal in the above linked comment would work iff the "own top buckets" 
> returned in phase#2 did not introduce any new/unseen values (and note, the 
> only case in which returning "own top buckets" would be significant _would_ 
> be the case in which it would introduce new/unseen values). If new values 
> _were_ returned in phase#2, the only way to ensure that requirement2 is 
> respected would be to violate requirement1 (i.e., by issuing _another_ 
> refinement request to determine whether any other shards have anything to 
> contribute to the previously unseen value).
> This counterintuitive behavior can't exactly be called a "bug", because IIUC 
> the intuitive behavior is fundamentally incompatible with the current 
> default/only {{simple}} refinement method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-15836) Address counterintuitive behavior of JSON "terms" subfacet refinement

Reply via email to