[jira] [Comment Edited] (SOLR-11968) Multi-words query time synonyms

Jim Ferenczi (JIRA) Thu, 22 Feb 2018 14:19:04 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373554#comment-16373554
 ]


Jim Ferenczi edited comment on SOLR-11968 at 2/22/18 10:17 PM:
---------------------------------------------------------------

bq. I think you're wrong, [~jim.ferenczi].

well it depends how you see the problem. I agree that the gap could be inferred 
when we build the graph, I have a patch that does that but there are some cases 
where we just can't. For instance the following synonym rules:

`twd, the walking dead` creates a broken token stream if you set a stop word 
filter that removes "the" after the synonym filter:

|| ||twd||walking||dead||
|posinc|1|1|1|
|poslen|3|1|1|

The gap produced by "the" is not propagated to the posInc of "walking" because 
the stop word appears on a token with a posInc equals to 0. There are other 
cases where it is not possible to "fix" the graph produced by the token stream 
which is why I said that a stop filter that would remove gaps is IMO the best 
solution.

bq. AFAICT Robert is suggesting a StopFilter *mode* that would *optionally* 
remove gaps. IOW its current behavior would remain (and be the default).

Yes I know that it would be an optional mode but at least it would allow to 
remove stop words inside a multi words synonyms.


was (Author: jim.ferenczi):
.bq I think you're wrong, [~jim.ferenczi].

well it depends how you see the problem. I agree that the gap could be inferred 
when we build the graph, I have a patch that does that but there are some cases 
where we just can't. For instance the following synonym rules:

`twd, the walking dead` creates a broken token stream if you set a stop word 
filter that removes "the" after the synonym filter:

|| ||twd||walking||dead||
|posinc|1|1|1|
|poslen|3|1|1|

The gap produced by "the" is not propagated to the posInc of "walking" because 
the stop word appears on a token with a posInc equals to 0. There are other 
cases where it is not possible to "fix" the graph produced by the token stream 
which is why I said that a stop filter that would remove gaps is IMO the best 
solution.

.bq AFAICT Robert is suggesting a StopFilter *mode* that would *optionally* 
remove gaps. IOW its current behavior would remain (and be the default).

Yes I know that it would be an optional mode but at least it would allow to 
remove stop words inside a multi words synonyms.

> Multi-words query time synonyms
> -------------------------------
>
>                 Key: SOLR-11968
>                 URL: https://issues.apache.org/jira/browse/SOLR-11968
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: query parsers, Schema and Analysis
>    Affects Versions: master (8.0), 6.6.2
>         Environment: Centos 7.x
>            Reporter: Dominique Béjean
>            Assignee: Steve Rowe
>            Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> <fieldType name="textSyn" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.ElisionFilterFactory" ignoreCase="true" 
>              articles="lang/contractions_fr.txt"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.ASCIIFoldingFilterFactory"/>
>        <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>        <filter class="solr.FrenchMinimalStemFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.ElisionFilterFactory" ignoreCase="true" 
>              articles="lang/contractions_fr.txt"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
>              ignoreCase="true" expand="true"/>
>        <filter class="solr.ASCIIFoldingFilterFactory"/>
>        <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>        <filter class="solr.FrenchMinimalStemFilterFactory"/>
>      </analyzer>
>    </fieldType>{code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  &sow=false
>  &qq=...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-11968) Multi-words query time synonyms

Reply via email to