[jira] [Comment Edited] (SOLR-15407) eDismax sow=false doesn't work with string field types

Alessandro Benedetti (Jira) Wed, 19 May 2021 01:30:04 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-15407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347410#comment-17347410
 ]


Alessandro Benedetti edited comment on SOLR-15407 at 5/19/21, 8:29 AM:
-----------------------------------------------------------------------

Hi David, first of all thanks for taking your time to think about this, it is 
much appreciated.
In regards to: 

{quote}sow=false implies the minimum should match is "per field"{quote}
I was thinking the same you think (i.e. sow to not affect mm, and mm to always 
be "per document").
Then I spent some time investigating to write a dedicated advanced blog (coming 
out in the next few days) and I verified that currently in 8.8.2 it's not the 
case.
Now, I don't know if it's on purpose or not, but if you have multi-field 
search, with different analysis per field, this is what you get (I post here a 
piece of the upcoming blog):

In the following examples, one field has synonyms, the other is just white 
space tokenized.
When the query parsed moves from being term centric(sow=true) to field 
centric(sow=false and different text analysis), mm means two different things:
mimimum of query terms matched, independently in which field (PER DOCUMENT)


{code:java}
sow = true
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax
"parsedquery_toString":
"+(((author:united | subjects_as_same_term:united) (author:kingdom | 
subjects_as_same_term:kingdom))~2)"
{code}
{code:java}
"response":{"numFound":2,"start":0,"maxScore":7.757958,"numFoundExact":true,"docs":[
      {
        "id":"888888",
        "author":"united",
        "subjects":["kingdom"],
        "score":7.757958},
      {
        "id":"77777",
        "author":"united kingdom",
        "score":5.874222}]
  },
{code}

mimimum of query terms matched within the same field (i.e. all query terms 
required must be found in one of the fields)
“PER FIELD”


{code:java}
sow = false
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax
"parsedquery_toString":
"+(((author:united author:kingdom)~2) | 
(((subjects_as_same_term:uk subjects_as_same_term:"united kingdom" 
subjects_as_same_term:england subjects_as_same_term:london 
subjects_as_same_term:british subjects_as_same_term:britain))~1))"
{code}

This (author:united author:kingdom)~2 means we need both the clauses to match 
to have a good candidate, in disjunction with
(subjects_as_same_term:uk subjects_as_same_term:”united kingdom” 
subjects_as_same_term:england subjects_as_same_term:london 
subjects_as_same_term:british subjects_as_same_term:britain))~1 that means we 
need at least one clause to match (because synonyms expanded the two original 
terms into a single one)


{code:java}
"response":{"numFound":1,"start":0,"maxScore":5.874222,"numFoundExact":true,"docs":[
      {
        "id":"77777",
        "author":"united kingdom",
        "score":5.874222}]
  }
{code}




was (Author: alessandro.benedetti):
Hi David, first of all thanks for taking your time to think about this, it is 
much appreciated.
In regards to: 

{quote}sow=false implies the minimum should match is "per field"{quote}
I was thinking the same you think (i.e. sow to not affect mm, and mm to always 
be "per document").
Then I spent some time investigating to write a dedicated advanced blog (coming 
out in the next few days) and I verified that currently in 8.8.2 it's not the 
case.
Now, I don't know if it's on purpose or not, but if you have multi-field 
search, with different analysis per field, this is what you get (I post here a 
piece of the upcoming blog):


When the query parsed moves from being term centric(sow=true) to field 
centric(sow=false and different text analysis), mm means two different things:
mimimum of query terms matched, independently in which field (PER DOCUMENT)


{code:java}
sow = true
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax
"parsedquery_toString":
"+(((author:united | subjects_as_same_term:united) (author:kingdom | 
subjects_as_same_term:kingdom))~2)"
{code}
{code:java}
"response":{"numFound":2,"start":0,"maxScore":7.757958,"numFoundExact":true,"docs":[
      {
        "id":"888888",
        "author":"united",
        "subjects":["kingdom"],
        "score":7.757958},
      {
        "id":"77777",
        "author":"united kingdom",
        "score":5.874222}]
  },
{code}

mimimum of query terms matched within the same field (i.e. all query terms 
required must be found in one of the fields)
“PER FIELD”


{code:java}
sow = false
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax
"parsedquery_toString":
"+(((author:united author:kingdom)~2) | 
(((subjects_as_same_term:uk subjects_as_same_term:"united kingdom" 
subjects_as_same_term:england subjects_as_same_term:london 
subjects_as_same_term:british subjects_as_same_term:britain))~1))"
{code}

This (author:united author:kingdom)~2 means we need both the clauses to match 
to have a good candidate, in disjunction with
(subjects_as_same_term:uk subjects_as_same_term:”united kingdom” 
subjects_as_same_term:england subjects_as_same_term:london 
subjects_as_same_term:british subjects_as_same_term:britain))~1 that means we 
need at least one clause to match (because synonyms expanded the two original 
terms into a single one)


{code:java}
"response":{"numFound":1,"start":0,"maxScore":5.874222,"numFoundExact":true,"docs":[
      {
        "id":"77777",
        "author":"united kingdom",
        "score":5.874222}]
  }
{code}



> eDismax sow=false doesn't work with string field types
> ------------------------------------------------------
>
>                 Key: SOLR-15407
>                 URL: https://issues.apache.org/jira/browse/SOLR-15407
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: query parsers
>    Affects Versions: 8.8.2
>            Reporter: Alessandro Benedetti
>            Priority: Major
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, the sow=false should not tokenize the input user query text and 
> delegate to each field for query time text analysis.
> But what happens if one of the queries involved is not analyzed?
> For example, because it is a string field type?
> Terms are split and the query generated is broken:
> {code:java}
>     assertU(adoc("id", "75", "trait_ss", "multi term"));
> public void testSplitOnWhitespace_stringField_shouldBuildSingleClause() 
> throws Exception
>     {
>         assertJQ(req("qf", "trait_ss", "defType", "edismax", "q", "multi 
> term", "sow", "false"),
>             "/response/numFound==1", "/response/docs/[0]/id=='75'");
>         String parsedquery;
>         parsedquery = getParsedQuery(
>             req("qf", "trait_ss", "q", "multi term", "defType", "edismax", 
> "sow", "false", "debugQuery", "true"));
>         assertThat(parsedquery, anyOf(containsString("((trait_ss:multi 
> term))")));
>     }
> {code}
> This test would be currently broken.
> The current parsed query is wrongly:
> (trait_ss:multi trait_ss:term)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Comment Edited] (SOLR-15407) eDismax sow=false doesn't work with string field types

Reply via email to