[
https://issues.apache.org/jira/browse/SOLR-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Toke Eskildsen updated SOLR-13056:
----------------------------------
Description:
Using {{SortableTextField}} for distributed faceting can lead to wrong results.
This can be demonstrated by installing the cloud-version of the
{{gettingstarted}} sample with
{{./solr -e cloud}}
using defaults all the way, except for {{shards}} which should be {{3}}. After
that a corpus can be indexed with
{{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo
"\\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo
'\\{"id":"duplicate_1","facet_t_sort":"a
b"},\\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H
'Content-Type: application/json'
'http://localhost:8983/solr/gettingstarted/update?commit=true'}}
This will index 100 documents with a single-valued field {{facet_t_sort:"a b
X"}} where X is the document number + 2 documents with {{facet_t_sort:"a b"}}.
The call
{{curl
'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort&facet.limit=5&facet=on&q=**:**&rows=0'}}
should return "a b" as the top facet term with count 2, but returns
{\{{}}
\{{ "responseHeader":{}}
\{{ "zkConnected":true,}}
\{{ "status":0,}}
\{{ "QTime":13,}}
\{{ "params":}}{{
{ "facet.limit":"5", "q":"*:*", "facet.field":"facet_t_sort", "rows":"0",
"facet":"on"}
}}{{},}}
\{{ "response":}}{{
{"numFound":102,"start":0,"maxScore":1.0,"docs":[] }
}}{{,}}
\{{ "facet_counts":{}}
\{{ "facet_queries":{},}}
\{{ "facet_fields":}}\{{{}}
\{{ "facet_t_sort":[ }}
\{{ "a b",36, }}
\{{ "a b 0",1, }}
\{{ "a b 1",1, }}
\{{ "a b 10",1, }}
{{ "a b 11",1]}}}{{,}}
\{{ "facet_ranges":{},}}
\{{ "facet_intervals":{},}}
\{{ "facet_heatmaps":{} } }}}
The problem is the second phase of simple faceting, where the fine-counting
happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. It
wins the popularity contest as there are 2 "a b"-terms and only 1 of all the
other terms. The 1 or 2 shards that did not deliver "a b" in the first phase
are then queried for the count for "a b", which happens in the form of a
{{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer
chain and thus matches _all_ the documents in that shard (approximately 102/3).
was:
Using {{SortableTextField}} for distributed faceting can lead to wrong results.
This can be demonstrated by installing the cloud-version of the
{{gettingstarted}} sample with
{{./solr -e cloud}}
using defaults all the way, except for {{shards}} which should be {{3}}. After
that a corpus can be indexed with
{{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo
"\\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo
'\\{"id":"duplicate_1","facet_t_sort":"a
b"},\\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H
'Content-Type: application/json'
'http://localhost:8983/solr/gettingstarted/update?commit=true'}}
This will index 100 documents with a single-valued field {{facet_t_sort:"a b
X"}} where X is the document number + 2 documents with {{facet_t_sort:"a b"}}.
The call
{{curl
'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort&facet.limit=5&facet=on&q=**:**&rows=0'}}
should return "a b" as the top facet term with count 2, but returns
{{{}}
{{ "responseHeader":{}}
{{ "zkConnected":true,}}
{{ "status":0,}}
{{ "QTime":13,}}
{{ "params":}}{{{ "facet.limit":"5", "q":"*:*", "facet.field":"facet_t_sort",
"rows":"0", "facet":"on"}}}{{},}}
{{ "response":}}{{{"numFound":102,"start":0,"maxScore":1.0,"docs":[] }}}{{,}}
{{ "facet_counts":{}}
{{ "facet_queries":{},}}
{{ "facet_fields":}}{{{}}
{{ "facet_t_sort":[ }}
{{ "a b",36, }}
{{ "a b 0",1, }}
{{ "a b 1",1, }}
{{ "a b 10",1, }}
{{ "a b 11",1]}}}{{,}}
{{ "facet_ranges":{},}}
{{ "facet_intervals":{},}}
{{ "facet_heatmaps":{} } }}}
The problem is the second phase of simple faceting, where the fine-counting
happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. It
wins the popularity contest as there are 2 "a b"-terms and only 1 of all the
other terms. The 1 or 2 shards that did not deliver "a b" in the first phase
are then queried for the count for "a b", which happens in the form of a
{{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer
chain and thus matches _all_ the documents in that shard (approximately 102/3).
> SortableTextField is trappy for faceting
> ----------------------------------------
>
> Key: SOLR-13056
> URL: https://issues.apache.org/jira/browse/SOLR-13056
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: search
> Affects Versions: 7.6
> Reporter: Toke Eskildsen
> Priority: Major
>
> Using {{SortableTextField}} for distributed faceting can lead to wrong
> results. This can be demonstrated by installing the cloud-version of the
> {{gettingstarted}} sample with
> {{./solr -e cloud}}
> using defaults all the way, except for {{shards}} which should be {{3}}.
> After that a corpus can be indexed with
> {{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo
> "\\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo
> '\\{"id":"duplicate_1","facet_t_sort":"a
> b"},\\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST
> -H 'Content-Type: application/json'
> 'http://localhost:8983/solr/gettingstarted/update?commit=true'}}
> This will index 100 documents with a single-valued field {{facet_t_sort:"a b
> X"}} where X is the document number + 2 documents with {{facet_t_sort:"a
> b"}}. The call
> {{curl
> 'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort&facet.limit=5&facet=on&q=**:**&rows=0'}}
> should return "a b" as the top facet term with count 2, but returns
> {\{{}}
> \{{ "responseHeader":{}}
> \{{ "zkConnected":true,}}
> \{{ "status":0,}}
> \{{ "QTime":13,}}
> \{{ "params":}}{{
> { "facet.limit":"5", "q":"*:*", "facet.field":"facet_t_sort", "rows":"0",
> "facet":"on"}
> }}{{},}}
> \{{ "response":}}{{
> {"numFound":102,"start":0,"maxScore":1.0,"docs":[] }
> }}{{,}}
> \{{ "facet_counts":{}}
> \{{ "facet_queries":{},}}
> \{{ "facet_fields":}}\{{{}}
> \{{ "facet_t_sort":[ }}
> \{{ "a b",36, }}
> \{{ "a b 0",1, }}
> \{{ "a b 1",1, }}
> \{{ "a b 10",1, }}
> {{ "a b 11",1]}}}{{,}}
> \{{ "facet_ranges":{},}}
> \{{ "facet_intervals":{},}}
> \{{ "facet_heatmaps":{} } }}}
> The problem is the second phase of simple faceting, where the fine-counting
> happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards.
> It wins the popularity contest as there are 2 "a b"-terms and only 1 of all
> the other terms. The 1 or 2 shards that did not deliver "a b" in the first
> phase are then queried for the count for "a b", which happens in the form of
> a {{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer
> chain and thus matches _all_ the documents in that shard (approximately
> 102/3).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]