[ 
https://issues.apache.org/jira/browse/SOLR-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-13056:
----------------------------------
    Description: 
Using {{SortableTextField}} for distributed faceting can lead to wrong results. 
This can be demonstrated by installing the cloud-version of the 
{{gettingstarted}} sample with

{{./solr -e cloud}}

using defaults all the way, except for {{shards}} which should be {{3}}. After 
that a corpus can be indexed with

{{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo 
"\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo 
'\{"id":"duplicate_1","facet_t_sort":"a 
b"},\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H 
'Content-Type: application/json' 
'http://localhost:8983/solr/gettingstarted/update?commit=true'}}

This will index 100 documents with a single-valued field {{facet_t_sort:"a b 
X"}} where X is the document number + 2 documents with {{facet_t_sort:"a b"}}. 
The call

{{curl 
'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort&facet.limit=5&facet=on&q=**:**&rows=0'}}

should return "a b" as the top facet term with count 2, but returns

{{ {}}
{{ "responseHeader":{}}
{{ "zkConnected":true,}}
{{ "status":0,}}
{{ "QTime":13,}}
{{ "params":{}}
{{ "facet.limit":"5",}}
{{ "q":":",}}
{{ "facet.field":"facet_t_sort",}}
{{ "rows":"0",}}
{{ "facet":"on"} },}}
{{ "response":{"numFound":102,"start":0,"maxScore":1.0,"docs":[]}}
{{ },}}
{{ "facet_counts":{}}
{{ "facet_queries":{},}}
{{ "facet_fields":{}}
{{ "facet_t_sort":[}}
{{ "a b",36,}}
{{ "a b 0",1,}}
{{ "a b 1",1,}}
{{ "a b 10",1,}}
{{ "a b 11",1]},}}
{{ "facet_ranges":{},}}
{{ "facet_intervals":{},}}
{{ "facet_heatmaps":{} } } }}

The problem is the second phase of simple faceting, where the fine-counting 
happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. It 
wins the popularity contest as there are 2 "a b"-terms and only 1 of all the 
other terms. The 1 or 2 shards that did not deliver "a b" in the first phase 
are then queried for the count for "a b", which happens in the form of a 
{{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer 
chain and thus matches _all_ the documents in that shard (approximately 102/3).

  was:
Using {{SortableTextField}} for distributed faceting can lead to wrong results. 
This can be demonstrated by installing the cloud-version of the 
{{gettingstarted}} sample with

{{./solr -e cloud}}

using defaults all the way, except for {{shards}} which should be {{3}}. After 
that a corpus can be indexed with

{{ ( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo 
"\\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo 
'\\{"id":"duplicate_1","facet_t_sort":"a 
b"},\\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H 
'Content-Type: application/json' 
'http://localhost:8983/solr/gettingstarted/update?commit=true' }}

This will index 100 documents with a single-valued field {{facet_t_sort:"a b 
X"}} where X is the document number + 2 documents with {{facet_t_sort:"a b"}}. 
The call

{{curl 
'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort&facet.limit=5&facet=on&q=**:**&rows=0'}}

should return "a b" as the top facet term with count 2, but returns

{\{{}}
 \{{ "responseHeader":{}}
 \{{ "zkConnected":true,}}
 \{{ "status":0,}}
 \{{ "QTime":13,}}
 \{{ "params":}}{{

{ "facet.limit":"5", "q":"*:*", "facet.field":"facet_t_sort", "rows":"0", 
"facet":"on"}

}}{{},}}
 \{{ "response":}}{{

{"numFound":102,"start":0,"maxScore":1.0,"docs":[] }

}}{{,}}
 \{{ "facet_counts":{}}
 \{{ "facet_queries":{},}}
 \{{ "facet_fields":}}\{{{}}
 \{{  "facet_t_sort":[ }}
 \{{    "a b",36, }}
 \{{    "a b 0",1, }}
 \{{    "a b 1",1, }}
 \{{    "a b 10",1, }}
 {{    "a b 11",1]}}}{{,}}
 \{{ "facet_ranges":{},}}
 \{{ "facet_intervals":{},}}
 \{{ "facet_heatmaps":{} } }}}

The problem is the second phase of simple faceting, where the fine-counting 
happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. It 
wins the popularity contest as there are 2 "a b"-terms and only 1 of all the 
other terms. The 1 or 2 shards that did not deliver "a b" in the first phase 
are then queried for the count for "a b", which happens in the form of a 
{{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer 
chain and thus matches _all_ the documents in that shard (approximately 102/3).


> SortableTextField is trappy for faceting
> ----------------------------------------
>
>                 Key: SOLR-13056
>                 URL: https://issues.apache.org/jira/browse/SOLR-13056
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: search
>    Affects Versions: 7.6
>            Reporter: Toke Eskildsen
>            Priority: Major
>
> Using {{SortableTextField}} for distributed faceting can lead to wrong 
> results. This can be demonstrated by installing the cloud-version of the 
> {{gettingstarted}} sample with
> {{./solr -e cloud}}
> using defaults all the way, except for {{shards}} which should be {{3}}. 
> After that a corpus can be indexed with
> {{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo 
> "\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo 
> '\{"id":"duplicate_1","facet_t_sort":"a 
> b"},\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H 
> 'Content-Type: application/json' 
> 'http://localhost:8983/solr/gettingstarted/update?commit=true'}}
> This will index 100 documents with a single-valued field {{facet_t_sort:"a b 
> X"}} where X is the document number + 2 documents with {{facet_t_sort:"a 
> b"}}. The call
> {{curl 
> 'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort&facet.limit=5&facet=on&q=**:**&rows=0'}}
> should return "a b" as the top facet term with count 2, but returns
> {{ {}}
> {{ "responseHeader":{}}
> {{ "zkConnected":true,}}
> {{ "status":0,}}
> {{ "QTime":13,}}
> {{ "params":{}}
> {{ "facet.limit":"5",}}
> {{ "q":":",}}
> {{ "facet.field":"facet_t_sort",}}
> {{ "rows":"0",}}
> {{ "facet":"on"} },}}
> {{ "response":{"numFound":102,"start":0,"maxScore":1.0,"docs":[]}}
> {{ },}}
> {{ "facet_counts":{}}
> {{ "facet_queries":{},}}
> {{ "facet_fields":{}}
> {{ "facet_t_sort":[}}
> {{ "a b",36,}}
> {{ "a b 0",1,}}
> {{ "a b 1",1,}}
> {{ "a b 10",1,}}
> {{ "a b 11",1]},}}
> {{ "facet_ranges":{},}}
> {{ "facet_intervals":{},}}
> {{ "facet_heatmaps":{} } } }}
> The problem is the second phase of simple faceting, where the fine-counting 
> happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. 
> It wins the popularity contest as there are 2 "a b"-terms and only 1 of all 
> the other terms. The 1 or 2 shards that did not deliver "a b" in the first 
> phase are then queried for the count for "a b", which happens in the form of 
> a {{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer 
> chain and thus matches _all_ the documents in that shard (approximately 
> 102/3).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to