[
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373554#comment-16373554
]
Jim Ferenczi edited comment on SOLR-11968 at 2/22/18 10:17 PM:
---------------------------------------------------------------
bq. I think you're wrong, [~jim.ferenczi].
well it depends how you see the problem. I agree that the gap could be inferred
when we build the graph, I have a patch that does that but there are some cases
where we just can't. For instance the following synonym rules:
`twd, the walking dead` creates a broken token stream if you set a stop word
filter that removes "the" after the synonym filter:
|| ||twd||walking||dead||
|posinc|1|1|1|
|poslen|3|1|1|
The gap produced by "the" is not propagated to the posInc of "walking" because
the stop word appears on a token with a posInc equals to 0. There are other
cases where it is not possible to "fix" the graph produced by the token stream
which is why I said that a stop filter that would remove gaps is IMO the best
solution.
bq. AFAICT Robert is suggesting a StopFilter *mode* that would *optionally*
remove gaps. IOW its current behavior would remain (and be the default).
Yes I know that it would be an optional mode but at least it would allow to
remove stop words inside a multi words synonyms.
was (Author: jim.ferenczi):
.bq I think you're wrong, [~jim.ferenczi].
well it depends how you see the problem. I agree that the gap could be inferred
when we build the graph, I have a patch that does that but there are some cases
where we just can't. For instance the following synonym rules:
`twd, the walking dead` creates a broken token stream if you set a stop word
filter that removes "the" after the synonym filter:
|| ||twd||walking||dead||
|posinc|1|1|1|
|poslen|3|1|1|
The gap produced by "the" is not propagated to the posInc of "walking" because
the stop word appears on a token with a posInc equals to 0. There are other
cases where it is not possible to "fix" the graph produced by the token stream
which is why I said that a stop filter that would remove gaps is IMO the best
solution.
.bq AFAICT Robert is suggesting a StopFilter *mode* that would *optionally*
remove gaps. IOW its current behavior would remain (and be the default).
Yes I know that it would be an optional mode but at least it would allow to
remove stop words inside a multi words synonyms.
> Multi-words query time synonyms
> -------------------------------
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: query parsers, Schema and Analysis
> Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
> Reporter: Dominique Béjean
> Assignee: Steve Rowe
> Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and
> SynonymGraphFilterFactory filter as explain in this article
>
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>
> My field type is :
> {code:java}
> <fieldType name="textSyn" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.ElisionFilterFactory" ignoreCase="true"
> articles="lang/contractions_fr.txt"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
> <filter class="solr.FrenchMinimalStemFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.ElisionFilterFactory" ignoreCase="true"
> articles="lang/contractions_fr.txt"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
> <filter class="solr.FrenchMinimalStemFilterFactory"/>
> </analyzer>
> </fieldType>{code}
>
> synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>
> stopwords.txt contains the word
> {code:java}
> de{code}
>
> The order of words in my query has an impact on the generated query in
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
> &sow=false
> &qq=...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil
> +name_text_gp:maillot) name_text_gp:om))",
> "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the
> same generated query
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the
> synonym expansion is ignored, but the second one shows it is not ignored and
> only the synonym term is used.
>
> When I test the analisys for the field type the synonyms are correctly
> expanded for both expressions
> {code:java}
> om maillot
> maillot om
> olympique de marseille maillot
> maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>
> So, i suspect an issue with edismax query parser.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]