[
https://issues.apache.org/jira/browse/SOLR-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177564#comment-14177564
]
Hoss Man commented on SOLR-6351:
--------------------------------
bq. The reason for previous random test failures were facet.limit,
facet.offset, facet.overrequest.count, facet.overrequest.ratio parameters
generated randomly, this was leading to inconsistent stats with pivot stats.
...that comment didn't really make sense to me based on how the code (should)
work, so i asked Vitaliy about it on IRC this morning. Talking it through with
him neatiher one of us could explain conceptually why it should matter if those
params were used -- once a given pivot constraint is selected to be returned to
the client, the "sausage" of why/how that constraint was selected shouldn't
affect the "stats" associated with that constraint -- from the point of view of
the stats code, there's just a DocSet (on each shard) representing a subset of
the full set of matches constrained by a term filter (the pivot constraint),
and those (per-pivot-constrain) stats can then be merged on the coordinator
node just like the top level stats.
So we ended the conversation with the assumption that there must either be a
bug in the test code that validates the randomly generated pivots+stats, or
there must be a bug in the actual pivot+stats logic.
After reading over the changes to TestCloudPivotFacets i couldn't spot any
obvious test flaws, so i started doing some manual testing and i think i've
uncovered the problem: It looks like Vitaliy's new code doesn't account for
stats returned by a shard in response to refinement requests.
This is pretty easy to reproduce if you spin up a 2 node system (ports 7777 &
8888) using hte example configs, then...
* add few docs to each node{noformat}
curl -sS 'http://localhost:7777/solr/collection1/update?commit=true' -H
'Content-Type: application/json' --data-binary '
[{"id": 71, "foo_s": "aaa", "bar_i": 1},
{"id": 72, "foo_s": "aaa", "bar_i": 20},
{'id': 73, "foo_s": "bbb", "bar_i": 300}]
'
curl -sS 'http://localhost:8888/solr/collection1/update?commit=true' -H
'Content-Type: application/json' --data-binary '
[{"id": 81, "foo_s": "bbb", "bar_i": 4000},
{"id": 82, "foo_s": "bbb", "bar_i": 50000},
{'id': 83, "foo_s": "aaa", "bar_i": 600000}]
'
{noformat}...note that "aaa" is the dominant term in the 7777 node, but "bbb"
is the dominant term in the 8888 node.
* do a simple pivot query + stats -- the default over request is more then
enough to resolve pivots fuly in a single pass...{noformat}
curl -sS
'http://localhost:8888/solr/collection1/select?q=*:*&shards=localhost:8888/solr,localhost:7777/solr&facet.pivot=\{!stats=sss\}foo_s&stats.field=\{!tag=sss\}bar_i&facet=true&stats=true&wt=json&indent=true&rows=0'
{
"responseHeader":{
"status":0,
"QTime":182,
"params":{
"facet":"true",
"shards":"localhost:8888/solr,localhost:7777/solr",
"indent":"true",
"stats":"true",
"stats.field":"{!tag=sss}bar_i",
"q":"*:*",
"wt":"json",
"facet.pivot":"{!stats=sss}foo_s",
"rows":"0"}},
"response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{},
"facet_intervals":{},
"facet_pivot":{
"foo_s":[{
"field":"foo_s",
"value":"aaa",
"count":3,
"stats":{
"stats_fields":{
"bar_i":{
"min":1.0,
"max":600000.0,
"count":3,
"missing":0,
"sum":600021.0,
"sumOfSquares":3.60000000401E11,
"mean":200007.0,
"stddev":346404.0994662159,
"facets":{}}}}},
{
"field":"foo_s",
"value":"bbb",
"count":3,
"stats":{
"stats_fields":{
"bar_i":{
"min":300.0,
"max":50000.0,
"count":3,
"missing":0,
"sum":54300.0,
"sumOfSquares":2.51609E9,
"mean":18100.0,
"stddev":27688.08407961808,
"facets":{}}}}}]}},
....
{noformat}...note that the stats for "bar_i" under pivot constraints "aaa" and
"bbb" look correct (at least min/max/count/sum do ... i'm assuming the rest are
as well)
* do the same query again, but set facet.limit=1 -- note the stats for 'aaa'
are exactly the same as before due to the default overrequest values still
resulting in no refinement queries needed...{noformat}
curl -sS
'http://localhost:8888/solr/collection1/select?q=*:*&shards=localhost:8888/solr,localhost:7777/solr&facet.pivot=\{!stats=sss\}foo_s&stats.field=\{!tag=sss\}bar_i&facet=true&stats=true&wt=json&indent=true&rows=0&facet.limit=1
{
"responseHeader":{
"status":0,
"QTime":14,
"params":{
"facet":"true",
"shards":"localhost:8888/solr,localhost:7777/solr",
"indent":"true",
"stats":"true",
"stats.field":"{!tag=sss}bar_i",
"q":"*:*",
"facet.limit":"1",
"wt":"json",
"facet.pivot":"{!stats=sss}foo_s",
"rows":"0"}},
"response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{},
"facet_intervals":{},
"facet_pivot":{
"foo_s":[{
"field":"foo_s",
"value":"aaa",
"count":3,
"stats":{
"stats_fields":{
"bar_i":{
"min":1.0,
"max":600000.0,
"count":3,
"missing":0,
"sum":600021.0,
"sumOfSquares":3.60000000401E11,
"mean":200007.0,
"stddev":346404.0994662159,
"facets":{}}}}}]}},
....
{noformat}
* now disable overrequesting, so that a refinement request *must* happen -- now
the numbers are wrong. it's obvious from the min/max/sum that the stats are
only coming from the node '7777' that returned 'aaa' as a pivot constraint on
the initial request, and no new stats were merged in after the refinement
request to localhost:8888...{noformat}
curl -sS
'http://localhost:8888/solr/collection1/select?q=*:*&shards=localhost:8888/solr,localhost:7777/solr&facet.pivot=\{!stats=sss\}foo_s&stats.field=\{!tag=sss\}bar_i&facet=true&stats=true&wt=json&indent=true&rows=0&facet.limit=1&facet.overrequest.count=0&facet.overrequest.ratio=0'
{
"responseHeader":{
"status":0,
"QTime":29,
"params":{
"facet.overrequest.count":"0",
"facet":"true",
"shards":"localhost:8888/solr,localhost:7777/solr",
"indent":"true",
"stats":"true",
"stats.field":"{!tag=sss}bar_i",
"q":"*:*",
"facet.limit":"1",
"facet.overrequest.ratio":"0",
"wt":"json",
"facet.pivot":"{!stats=sss}foo_s",
"rows":"0"}},
"response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{},
"facet_intervals":{},
"facet_pivot":{
"foo_s":[{
"field":"foo_s",
"value":"aaa",
"count":3,
"stats":{
"stats_fields":{
"bar_i":{
"min":1.0,
"max":20.0,
"count":2,
"missing":0,
"sum":21.0,
"sumOfSquares":401.0,
"mean":10.5,
"stddev":13.435028842544403,
"facets":{}}}}}]}},
....
{noformat}
----
I know some of the existing pivot tests (like DistributedFacetPivotLongTailTest
& DistributedFacetPivotLargeTest) have sections that ensure refinement is
working -- if we update those sections to also include some stats & assertions
on those stats we should be able to reliably reproduce this bug and then work
on fixing it.
> Let Stats Hang off of Pivots (via 'tag')
> ----------------------------------------
>
> Key: SOLR-6351
> URL: https://issues.apache.org/jira/browse/SOLR-6351
> Project: Solr
> Issue Type: Sub-task
> Reporter: Hoss Man
> Attachments: SOLR-6351.patch, SOLR-6351.patch, SOLR-6351.patch,
> SOLR-6351.patch, SOLR-6351.patch, SOLR-6351.patch, SOLR-6351.patch,
> SOLR-6351.patch
>
>
> he goal here is basically flip the notion of "stats.facet" on it's head, so
> that instead of asking the stats component to also do some faceting
> (something that's never worked well with the variety of field types and has
> never worked in distributed mode) we instead ask the PivotFacet code to
> compute some stats X for each leaf in a pivot. We'll do this with the
> existing {{stats.field}} params, but we'll leverage the {{tag}} local param
> of the {{stats.field}} instances to be able to associate which stats we want
> hanging off of which {{facet.pivot}}
> Example...
> {noformat}
> facet.pivot={!stats=s1}category,manufacturer
> stats.field={!key=avg_price tag=s1 mean=true}price
> stats.field={!tag=s1 min=true max=true}user_rating
> {noformat}
> ...with the request above, in addition to computing the min/max user_rating
> and mean price (labeled "avg_price") over the entire result set, the
> PivotFacet component will also include those stats for every node of the tree
> it builds up when generating a pivot of the fields "category,manufacturer"
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]