[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

Hoss Man (JIRA) Sun, 08 Jul 2018 19:28:29 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536494#comment-16536494
 ]


Hoss Man commented on SOLR-12343:
---------------------------------

Found one – it seems to be specific to the situation where {{overrequest==0}}, 
and the facet is nested under another facet?

playing the with values of {{top_over}} and {{top_refine}} it doesn't seem to 
matter if parent facet is refined, but the key is wether the top facet also 
uses {{overrequest:0}} (fails) or {{overrequest:999}} (passes)

 
{noformat}
   [junit4]   2> 9990 INFO  (qtp1276305453-48) [    x:collection1] 
o.a.s.c.S.Request [collection1]  webapp=/solr path=/select 
params={df=text&distrib=false&_facet_={}&fl=id&fl=score&shards.purpose=1048580&start=0&fsv=true&shard.url=127.0.0.1:47372/solr/collection1&rows=0&version=2&q=*:*&json.facet={+all:{+type:terms,+field:all_ss,+limit:1,+refine:true,+overrequest:0+++++++,+facet:{+++cat_count:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++++++++++++++,+refine:true,+sort:'count+asc'+},+++cat_price:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++++++++++++++,+refine:true,+sort:'sum_p+asc'++++++++++++++++,+facet:+{+sum_p:+'sum(price_i)'+}+}}+}+}&NOW=1531102182236&isShard=true&wt=javabin}
 hits=9 status=0 QTime=17
   [junit4]   2> 9994 INFO  (qtp1276305453-49) [    x:collection1] 
o.a.s.c.S.Request [collection1]  webapp=/solr path=/select 
params={df=text&distrib=false&_facet_={"refine":{"all":{"_p":[["z_all",{"cat_count":{"_l":["A","B","C"]},"cat_price":{"_l":["A","B","C"]}}]]}}}&shards.purpose=2097152&shard.url=127.0.0.1:47372/solr/collection1&rows=0&version=2&q=*:*&json.facet={+all:{+type:terms,+field:all_ss,+limit:1,+refine:true,+overrequest:0+++++++,+facet:{+++cat_count:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++++++++++++++,+refine:true,+sort:'count+asc'+},+++cat_price:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++++++++++++++,+refine:true,+sort:'sum_p+asc'++++++++++++++++,+facet:+{+sum_p:+'sum(price_i)'+}+}}+}+}&NOW=1531102182236&isShard=true&facet=false&wt=javabin}
 hits=9 status=0 QTime=1
   [junit4]   2> 9996 INFO  (qtp1503674478-65) [    x:collection1] 
o.a.s.c.S.Request [collection1]  webapp=/solr path=/select 
params={shards=127.0.0.1:54950/solr/collection1,127.0.0.1:47372/solr/collection1,127.0.0.1:52833/solr/collection1&shards=debugQuery&shards=true&q=*:*&json.facet={+all:{+type:terms,+field:all_ss,+limit:1,+refine:true,+overrequest:0+++++++,+facet:{+++cat_count:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++++++++++++++,+refine:true,+sort:'count+asc'+},+++cat_price:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++++++++++++++,+refine:true,+sort:'sum_p+asc'++++++++++++++++,+facet:+{+sum_p:+'sum(price_i)'+}+}}+}+}&indent=true&rows=0&wt=json&version=2.2}
 hits=19 status=0 QTime=25
   [junit4]   2> 9997 ERROR 
(TEST-TestJsonFacetRefinement.testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN-seed#[775BF43EF8268D50])
 [    ] o.a.s.SolrTestCaseHS query failed JSON validation. error=mismatch: 
'X'!='C' @ facets/all/buckets/[0]/cat_count/buckets/[2]/val
   [junit4]   2>  expected =facets=={ count: 19,all:{ buckets:[   { val:z_all, 
count: 19,    cat_count:{ buckets:[                  {val:A,count:1},           
      {val:B,count:1},                 {val:X,count:4},    ] },    cat_price:{ 
buckets:[                  {val:A,count:1,sum_p:1.0},                 
{val:B,count:1,sum_p:1.0},                 {val:X,count:4,sum_p:4.0},    ] }} ] 
} }
   [junit4]   2>  response = {
   [junit4]   2>   "responseHeader":{
   [junit4]   2>     "status":0,
   [junit4]   2>     "QTime":25},
   [junit4]   2>   "response":{"numFound":19,"start":0,"maxScore":1.0,"docs":[]
   [junit4]   2>   },
   [junit4]   2>   "facets":{
   [junit4]   2>     "count":19,
   [junit4]   2>     "all":{
   [junit4]   2>       "buckets":[{
   [junit4]   2>           "val":"z_all",
   [junit4]   2>           "count":19,
   [junit4]   2>           "cat_price":{
   [junit4]   2>             "buckets":[{
   [junit4]   2>                 "val":"A",
   [junit4]   2>                 "count":1,
   [junit4]   2>                 "sum_p":1.0},
   [junit4]   2>               {
   [junit4]   2>                 "val":"B",
   [junit4]   2>                 "count":1,
   [junit4]   2>                 "sum_p":1.0},
   [junit4]   2>               {
   [junit4]   2>                 "val":"C",
   [junit4]   2>                 "count":6,
   [junit4]   2>                 "sum_p":6.0}]},
   [junit4]   2>           "cat_count":{
   [junit4]   2>             "buckets":[{
   [junit4]   2>                 "val":"A",
   [junit4]   2>                 "count":1},
   [junit4]   2>               {
   [junit4]   2>                 "val":"B",
   [junit4]   2>                 "count":1},
   [junit4]   2>               {
   [junit4]   2>                 "val":"C",
   [junit4]   2>                 "count":6}]}}]}}}
   [junit4]   2> 
   [junit4]   2> 10000 INFO  
(TEST-TestJsonFacetRefinement.testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN-seed#[775BF43EF8268D50])
 [    ] o.a.s.SolrTestCaseJ4 ###Ending 
testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestJsonFacetRefinement 
-Dtests.method=testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN 
-Dtests.seed=775BF43EF8268D50 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=pl-PL -Dtests.timezone=Africa/Bamako -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] ERROR   4.32s | 
TestJsonFacetRefinement.testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN
 <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: mismatch: 'X'!='C' @ 
facets/all/buckets/[0]/cat_count/buckets/[2]/val
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([775BF43EF8268D50:DB8655EB2671818E]:0)
   [junit4]    >        at 
org.apache.solr.SolrTestCaseHS.matchJSON(SolrTestCaseHS.java:161)
   [junit4]    >        at 
org.apache.solr.SolrTestCaseHS.assertJQ(SolrTestCaseHS.java:143)
   [junit4]    >        at 
org.apache.solr.SolrTestCaseHS$Client$Tester.assertJQ(SolrTestCaseHS.java:255)
   [junit4]    >        at 
org.apache.solr.SolrTestCaseHS$Client.testJQ(SolrTestCaseHS.java:297)
   [junit4]    >        at 
org.apache.solr.search.facet.TestJsonFacetRefinement.testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN(TestJsonFacetRefinement.java:568)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> 10016 INFO  
(SUITE-TestJsonFacetRefinement-seed#[775BF43EF8268D50]-worker) [    ] 
o.e.j.s.Abs
{noformat}


...i haven't worked through it yet to figure out the problem, but my initial 
impression is that i made this test too aggressive? I'm not sure it's safe to 
assert correct results with {{top_over=1}} ... but i'm not sure why it matters 
what the sub-facet overrequest is in that case?

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-12343
>                 URL: https://issues.apache.org/jira/browse/SOLR-12343
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Yonik Seeley
>            Priority: Major
>         Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

Reply via email to