[
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370883#comment-16370883
]
Steve Rowe commented on SOLR-11968:
-----------------------------------
bq. I think the root cause is LUCENE-4065. I'll try to make a simple test
demonstrating this.
Not so - LUCENE-4065 should probably be closed as won't-fix (I'll comment there
in a sec).
Instead, this looks like the problem described in LUCENE-7848. I tracked the
problem down to a bug in Lucene's QueryBuilder, which is dropping tokens in
side paths with position gaps that are caused by StopFilter.
Below is a test that shows the problem - MockSynonymFilter has synonym "cavy"
for "guinea pig", and the anonymous analyzer below has "pig" on its
stopfilter's stoplist. QueryBuilder produces a query for only "cavy", even
though the token stream also contains "guinea".
{code:java|title=TestQueryBuilder.java}
public void testGraphStop() {
Query syn1 = new TermQuery(new Term("field", "guinea"));
Query syn2 = new TermQuery(new Term("field", "cavy"));
BooleanQuery synQuery = new BooleanQuery.Builder()
.add(syn1, BooleanClause.Occur.SHOULD)
.add(syn2, BooleanClause.Occur.SHOULD)
.build();
BooleanQuery expectedGraphQuery = new BooleanQuery.Builder()
.add(synQuery, BooleanClause.Occur.SHOULD)
.build();
QueryBuilder queryBuilder = new QueryBuilder(new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
MockTokenizer tokenizer = new MockTokenizer();
TokenStream stream = new MockSynonymFilter(tokenizer);
stream = new StopFilter(stream,
CharArraySet.copy(Collections.singleton("pig")));
return new TokenStreamComponents(tokenizer, stream);
}
});
queryBuilder.setAutoGenerateMultiTermSynonymsPhraseQuery(true);
assertEquals(expectedGraphQuery, queryBuilder.createBooleanQuery("field",
"guinea pig", BooleanClause.Occur.SHOULD));
}
}
{code}
> Multi-words query time synonyms
> -------------------------------
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: query parsers, Schema and Analysis
> Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
> Reporter: Dominique Béjean
> Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and
> SynonymGraphFilterFactory filter as explain in this article
>
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>
> My field type is :
> {code:java}
> <fieldType name="textSyn" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.ElisionFilterFactory" ignoreCase="true"
> articles="lang/contractions_fr.txt"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
> <filter class="solr.FrenchMinimalStemFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.ElisionFilterFactory" ignoreCase="true"
> articles="lang/contractions_fr.txt"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
> <filter class="solr.FrenchMinimalStemFilterFactory"/>
> </analyzer>
> </fieldType>{code}
>
> synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>
> stopwords.txt contains the word
> {code:java}
> de{code}
>
> The order of words in my query has an impact on the generated query in
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
> &sow=false
> &qq=...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil
> +name_text_gp:maillot) name_text_gp:om))",
> "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the
> same generated query
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the
> synonym expansion is ignored, but the second one shows it is not ignored and
> only the synonym term is used.
>
> When I test the analisys for the field type the synonyms are correctly
> expanded for both expressions
> {code:java}
> om maillot
> maillot om
> olympique de marseille maillot
> maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>
> So, i suspect an issue with edismax query parser.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]