[ https://issues.apache.org/jira/browse/SOLR-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518434#comment-17518434 ]
Michael Gibney commented on SOLR-16144: --------------------------------------- Even if the current implementation is left as-is, we should at least throw an error if a client tries to explicitly set a {{min_popularity}} value less than {{0.00001}} (which would currently effectively exclude _all_ buckets). However, I think it would be preferable to _not_ round these values internally. For a relatively high-cardinality field, perfect correlation for a {{background_popularity}} of 9/2,000,000 feels meaningful, and in any case well above any threshold that I might intuitively consider to indicate "noise". My sense is that different use cases would have different "noise" thresholds, and that the purpose of the {{min_popularity}} param is to allow clients to specify their own "noise" threshold. AFAICT it's cleaner and there's no real downside to deferring pop-value-rounding until the response is externalized to be sent back to the client. [PR #790|https://github.com/apache/solr/pull/790] makes concrete the above "preferred" proposal. > Don't internally round [foreground|background]_popularity values in > RelatednessAgg > ---------------------------------------------------------------------------------- > > Key: SOLR-16144 > URL: https://issues.apache.org/jira/browse/SOLR-16144 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Facet Module > Affects Versions: main (10.0) > Reporter: Michael Gibney > Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > The "relatedness" facet function supports the concept of > {{foreground_popularity}} and {{background_popularity}} -- i.e., the > cardinality of the intersection of bucket domain with the foreground and > background sets (respectively), each normalized with respect to background > set cardinality. > The logic appears to be: > # To provide clients with context of computed relatedness values > # To preemptively (optionally) screen out "noise" from low-frequency terms > via the {{min_popularity}} function parameter. > For both purposes, popularity values are currently rounded to 5 digits. > This issue proposes that although rounding to 5 digits makes sense for the > _first_ case (providing context to clients), this arbitrary truncation does > not make sense as currently implemented for internally evaluating threshold > pop values for bucket inclusion. > Consider the case of a high-cardinality field with a relatively large > background set and a selective foreground set. For {{|background_set| = > 2,000,000}} and a foreground set of cardinality 9, even a bucket with a > domain that exactly matches the foreground set would be screened out, for > _any_ explicit setting of {{min_popularity}}. > This behavior is due to where the rounding takes place (internally, upon > initial {{computeDerivedValues()}}). It is further problematic that > {{RelatednessAgg}} will currently accept {{min_popularity < 0.00001}}, which > would be guaranteed to exclude _all_ buckets. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org