[jira] [Comment Edited] (SOLR-8496) Facet search count numbers are falsified by older document versions

Vasiliy Bout (JIRA) Fri, 15 Jan 2016 08:29:58 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102003#comment-15102003
 ]


Vasiliy Bout edited comment on SOLR-8496 at 1/15/16 4:28 PM:
-------------------------------------------------------------

I developed a small example on how to reproduce this problem with the 
completely new core with a very simple schema and about 20 documents in the 
core.

First of all, I created a new core with the following schema.xml:
{noformat}
<?xml version="1.0" ?>
<schema name="basic" version="1.1">
    <types>
        <fieldType name="string" class="solr.StrField" omitNorms="true" 
indexed="true" stored="true"/>
        <fieldType name="int" class="solr.TrieIntField" precisionStep="0" 
positionIncrementGap="0" indexed="true" stored="true"/>
    </types>
    <fields>
        <field name="id" type="string" required="true"/>
        <field name="foo_s" type="string"/>
        <field name="bar_s" type="string" docValues="true"/>
        <field name="foo_i" type="int"/>
        <field name="bar_i" type="int" docValues="true"/>
    </fields>
    <uniqueKey>id</uniqueKey>
    <solrQueryParser defaultOperator="OR"/>
</schema>
{noformat}

After that, I generated a set of documents to fill the core with. I launched 
{{python}} interpreter in the terminal and typed the following oneliner:
{noformat}
[ {"id":i,"foo_i":i,"bar_i":i,"foo_s":i,"bar_s":i} for i in range(1, 21) ]
{noformat}

It gave me a set of 20 documents. This is the same set but slightly formatted 
to be human readable:
{noformat}
[
    {'bar_s': 1, 'foo_i': 1, 'bar_i': 1, 'foo_s': 1, 'id': 1},
    {'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2},
    {'bar_s': 3, 'foo_i': 3, 'bar_i': 3, 'foo_s': 3, 'id': 3},
    {'bar_s': 4, 'foo_i': 4, 'bar_i': 4, 'foo_s': 4, 'id': 4},
    {'bar_s': 5, 'foo_i': 5, 'bar_i': 5, 'foo_s': 5, 'id': 5},
    {'bar_s': 6, 'foo_i': 6, 'bar_i': 6, 'foo_s': 6, 'id': 6},
    {'bar_s': 7, 'foo_i': 7, 'bar_i': 7, 'foo_s': 7, 'id': 7},
    {'bar_s': 8, 'foo_i': 8, 'bar_i': 8, 'foo_s': 8, 'id': 8},
    {'bar_s': 9, 'foo_i': 9, 'bar_i': 9, 'foo_s': 9, 'id': 9},
    {'bar_s': 10, 'foo_i': 10, 'bar_i': 10, 'foo_s': 10, 'id': 10},
    {'bar_s': 11, 'foo_i': 11, 'bar_i': 11, 'foo_s': 11, 'id': 11},
    {'bar_s': 12, 'foo_i': 12, 'bar_i': 12, 'foo_s': 12, 'id': 12},
    {'bar_s': 13, 'foo_i': 13, 'bar_i': 13, 'foo_s': 13, 'id': 13},
    {'bar_s': 14, 'foo_i': 14, 'bar_i': 14, 'foo_s': 14, 'id': 14},
    {'bar_s': 15, 'foo_i': 15, 'bar_i': 15, 'foo_s': 15, 'id': 15},
    {'bar_s': 16, 'foo_i': 16, 'bar_i': 16, 'foo_s': 16, 'id': 16},
    {'bar_s': 17, 'foo_i': 17, 'bar_i': 17, 'foo_s': 17, 'id': 17},
    {'bar_s': 18, 'foo_i': 18, 'bar_i': 18, 'foo_s': 18, 'id': 18},
    {'bar_s': 19, 'foo_i': 19, 'bar_i': 19, 'foo_s': 19, 'id': 19},
    {'bar_s': 20, 'foo_i': 20, 'bar_i': 20, 'foo_s': 20, 'id': 20}
]
{noformat}

After that I opened Solr Admin page in my browser, went to the "Documents" tab 
of my core and filled the core with the set of documents above. I selected the 
following parameters:
* Request-Handler (qt): {{/update/json}};
* Document Type: {{Solr Command (raw XML or JSON)}};
* Documents set to the above JSON generate in python interpreter.

After the Solr core is filled with documents, I add a single document once 
again, so this document overwrites the previous one:
{noformat}
{'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2}
{noformat}

Now when I look at the "Overview" tab I see the following statistics:
{noformat}
Last Modified: less than a minute ago
Num Docs: 20
Max Doc: 21
Heap Memory Usage: -1
Deleted Docs: 1
Version: 7
Segment Count: 2
{noformat}

And at this stage all multi select facet queries give incorrect results. Since 
all the documents in the core have unique values for all fields, all facet 
queries should give count {{1}} for all values for all fields. Simple facet 
queries return correct results:

query is 
{{q=\*:\*&rows=0&facet=true&facet.limit=1&facet.field=foo_s&facet.field=foo_i&facet.field=bar_s&facet.field=bar_i}}
response is
{noformat}
{
  "responseHeader":{"status":0,"QTime":1},
  "response":{"numFound":20,"start":0,"docs":[]},
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "foo_s":["1",1],
      "foo_i":["1",1],
      "bar_s":["1",1],
      "bar_i":["1",1]
    },
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
{noformat}

And this is what we get for multi select facet query:

query is 
{{q=\*:\*&fq=\{!tag=a\}id:\*&rows=0&facet=true&facet.limit=1&facet.field=\{!ex=a\}foo_s&facet.field=\{!ex=a\}foo_i&facet.field=\{!ex=a\}bar_s&facet.field=\{!ex=a\}bar_i}}
response is
{noformat}
{
  "responseHeader":{"status":0,"QTime":2},
  "response":{"numFound":20,"start":0,"docs":[]},
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "foo_s":["2",2],
      "foo_i":["2",2],
      "bar_s":["2",2],
      "bar_i":["2",2]},
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
{noformat}

So we get count {{2}} for value {{"2"}}, i.e. replaced (old) version of the 
document with {{id=2}} is taken into account when using multi selection facets.



was (Author: vasiliy.bout):
I developed a small example on how to reproduce this problem with the 
completely new core with a very simple schema and about 20 documents in the 
core.

First of all, I created a new core with the following schema.xml:
{noformat}
<?xml version="1.0" ?>
<schema name="basic" version="1.1">
    <types>
        <fieldType name="string" class="solr.StrField" omitNorms="true" 
indexed="true" stored="true"/>
        <fieldType name="int" class="solr.TrieIntField" precisionStep="0" 
positionIncrementGap="0" indexed="true" stored="true"/>
    </types>
    <fields>
        <field name="id" type="string" required="true"/>
        <field name="foo_s" type="string"/>
        <field name="bar_s" type="string" docValues="true"/>
        <field name="foo_i" type="int"/>
        <field name="bar_i" type="int" docValues="true"/>
    </fields>
    <uniqueKey>id</uniqueKey>
    <solrQueryParser defaultOperator="OR"/>
</schema>
{noformat}

After that, I generated a set of documents to fill the core with. I launched 
{{python}} interpreter in the terminal and typed the following oneliner:
{noformat}
[ {"id":i,"foo_i":i,"bar_i":i,"foo_s":i,"bar_s":i} for i in range(1, 21) ]
{noformat}

It gave me a set of 20 documents. This is the same set but slightly formatted 
to be human readable:
{noformat}
[
    {'bar_s': 1, 'foo_i': 1, 'bar_i': 1, 'foo_s': 1, 'id': 1},
    {'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2},
    {'bar_s': 3, 'foo_i': 3, 'bar_i': 3, 'foo_s': 3, 'id': 3},
    {'bar_s': 4, 'foo_i': 4, 'bar_i': 4, 'foo_s': 4, 'id': 4},
    {'bar_s': 5, 'foo_i': 5, 'bar_i': 5, 'foo_s': 5, 'id': 5},
    {'bar_s': 6, 'foo_i': 6, 'bar_i': 6, 'foo_s': 6, 'id': 6},
    {'bar_s': 7, 'foo_i': 7, 'bar_i': 7, 'foo_s': 7, 'id': 7},
    {'bar_s': 8, 'foo_i': 8, 'bar_i': 8, 'foo_s': 8, 'id': 8},
    {'bar_s': 9, 'foo_i': 9, 'bar_i': 9, 'foo_s': 9, 'id': 9},
    {'bar_s': 10, 'foo_i': 10, 'bar_i': 10, 'foo_s': 10, 'id': 10},
    {'bar_s': 11, 'foo_i': 11, 'bar_i': 11, 'foo_s': 11, 'id': 11},
    {'bar_s': 12, 'foo_i': 12, 'bar_i': 12, 'foo_s': 12, 'id': 12},
    {'bar_s': 13, 'foo_i': 13, 'bar_i': 13, 'foo_s': 13, 'id': 13},
    {'bar_s': 14, 'foo_i': 14, 'bar_i': 14, 'foo_s': 14, 'id': 14},
    {'bar_s': 15, 'foo_i': 15, 'bar_i': 15, 'foo_s': 15, 'id': 15},
    {'bar_s': 16, 'foo_i': 16, 'bar_i': 16, 'foo_s': 16, 'id': 16},
    {'bar_s': 17, 'foo_i': 17, 'bar_i': 17, 'foo_s': 17, 'id': 17},
    {'bar_s': 18, 'foo_i': 18, 'bar_i': 18, 'foo_s': 18, 'id': 18},
    {'bar_s': 19, 'foo_i': 19, 'bar_i': 19, 'foo_s': 19, 'id': 19},
    {'bar_s': 20, 'foo_i': 20, 'bar_i': 20, 'foo_s': 20, 'id': 20}
]
{noformat}

After that I opened Solr Admin page in my browser, went to the "Documents" tab 
of my core and filled the core with the set of documents above. I selected the 
following parameters:
* Request-Handler (qt): {{/update/json}};
* Document Type: {{Solr Command (raw XML or JSON)}};
* Documents set to the above JSON generate in python interpreter.

After the Solr core is filled with documents, I add a single document once 
again, so this document overwrites the previous one:
{noformat}
{'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2}
{noformat}

Now when I look at the "Overview" tab I see the following statistics:
{noformat}
Last Modified: less than a minute ago
Num Docs: 20
Max Doc: 21
Heap Memory Usage: -1
Deleted Docs: 1
Version: 25
Segment Count: 2
{noformat}

And at this stage all multi select facet queries give incorrect results. Since 
all the documents in the core have unique values for all fields, all facet 
queries should give count {{1}} for all values for all fields. Simple facet 
queries return correct results:

query is 
{{q=\*:\*&rows=0&facet=true&facet.limit=1&facet.field=foo_s&facet.field=foo_i&facet.field=bar_s&facet.field=bar_i}}
response is
{noformat}
{
  "responseHeader":{"status":0,"QTime":1},
  "response":{"numFound":20,"start":0,"docs":[]},
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "foo_s":["1",1],
      "foo_i":["1",1],
      "bar_s":["1",1],
      "bar_i":["1",1]
    },
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
{noformat}

And this is what we get for multi select facet query:

query is 
{{q=\*:\*&fq=\{!tag=a\}id:\*&rows=0&facet=true&facet.limit=1&facet.field=\{!ex=a\}foo_s&facet.field=\{!ex=a\}foo_i&facet.field=\{!ex=a\}bar_s&facet.field=\{!ex=a\}bar_i}}
response is
{noformat}
{
  "responseHeader":{"status":0,"QTime":2},
  "response":{"numFound":20,"start":0,"docs":[]},
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "foo_s":["2",2],
      "foo_i":["2",2],
      "bar_s":["2",2],
      "bar_i":["2",2]},
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
{noformat}

So we get count {{2}} for value {{"2"}}, i.e. replaced (old) version of the 
document with {{id=2}} is taken into account when using multi selection facets.


> Facet search count numbers are falsified by older document versions
> -------------------------------------------------------------------
>
>                 Key: SOLR-8496
>                 URL: https://issues.apache.org/jira/browse/SOLR-8496
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 5.4
>         Environment: Linux 3.16.0-4-amd64 x86_64 Debian 8.2
> openjdk-7-jre-headless:amd64   version 7u91-2.6.3-1~deb8u1
> solr-5.4.0, extracted from official tar
> Default solr settings from install script:SOLR_HEAP="512m"
> GC_LOG_OPTS="-verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails \
> -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution 
> -XX:+PrintGCApplicationStoppedTime"
> GC_TUNE="-XX:NewRatio=3 \
> -XX:SurvivorRatio=4 \
> -XX:TargetSurvivorRatio=90 \
> -XX:MaxTenuringThreshold=8 \
> -XX:+UseConcMarkSweepGC \
> -XX:+UseParNewGC \
> -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
> -XX:+CMSScavengeBeforeRemark \
> -XX:PretenureSizeThreshold=64m \
> -XX:+UseCMSInitiatingOccupancyOnly \
> -XX:CMSInitiatingOccupancyFraction=50 \
> -XX:CMSMaxAbortablePrecleanTime=6000 \
> -XX:+CMSParallelRemarkEnabled \
> -XX:+ParallelRefProcEnabled"
> SOLR_OPTS="$SOLR_OPTS -Xss256k"
>            Reporter: Andreas Müller
>
> Our setup is based on multiple cores. In One core we have a multi-filed with 
> integer values. and some other unimportant fields. We're using multi-faceting 
> for this field.
> We're querying a test scenario with:
> {code}
> http://localhost:8983/solr/core-name/select?q=dummyask: (true) AND 
> manufacturer: false AND id: (15039 16882 10850 
> 20781)&fq={!tag=professions}professions: 
> (59)&fl=id&wt=json&indent=true&facet=true&facet.field={!ex=professions}professions
> {code}
> - Query: (numDocs:48545, maxDoc:48545)
> {code:xml}
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">1</int>
> </lst>
> <result name="response" numFound="4" start="0">
> <doc>
> <int name="id">10850</int>
> </doc>
> <doc>
> <int name="id">16882</int>
> </doc>
> <doc>
> <int name="id">15039</int>
> </doc>
> <doc>
> <int name="id">20781</int>
> </doc>
> </result>
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="professions">
> <int name="59">4</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> <lst name="facet_intervals"/>
> <lst name="facet_heatmaps"/>
> </lst>
> </response>
> {code}
> - Then we update one document and change some fields (numDocs:48545, 
> maxDoc:48546) *The number of maxDocs is increased*
> {code:xml}
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">1</int>
> </lst>
> <result name="response" numFound="4" start="0">
> <doc>
> <int name="id">10850</int>
> </doc>
> <doc>
> <int name="id">16882</int>
> </doc>
> <doc>
> <int name="id">15039</int>
> </doc>
> <doc>
> <int name="id">20781</int>
> </doc>
> </result>
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="professions">
> <int name="59">5</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> <lst name="facet_intervals"/>
> <lst name="facet_heatmaps"/>
> </lst>
> </response>
> {code}
> *The Problem:*
> In the first query, we're getting a facet count of 4, which is correct. After 
> updating one document, we're getting 5 as a result wich is not correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-8496) Facet search count numbers are falsified by older document versions

Reply via email to