RE: Solr limit in words search - take 2

Scott Wed, 17 Nov 2021 09:57:38 -0800

Could this be related ?

https://solr.apache.org/guide/6_6/filter-descriptions.html#FilterDescriptions-WordDelimiterGraphFilter

"If you use this filter during indexing, you must follow it with a Flatten 
Graph Filter to squash tokens on top of one another like the Word Delimiter 
Filter, because the indexer can’t directly consume a graph. To get fully 
correct positional queries when tokens are split, you should instead use this 
filter at query time."

-----Original Message-----
From: Michael Gibney <mich...@michaelgibney.net> 
Sent: Wednesday, November 17, 2021 12:07 PM
To: users@solr.apache.org
Subject: Re: Solr limit in words search - take 2

This is not the most thorough answer, but hopefully gets you headed in the 
right direction:

Very strange things can happen when your index-time analysis chain generates 
"graph" token-streams (as yours does). A couple of things you could try:
1. experiment with setting `enableGraphQueries=false` on the fieldtype 2. 
upgrading to solr >=8.1 may address your issue partially, via
LUCENE-8730 -- here I go out on a limb in guessing that you're not _already_ on 
8.1+ :-) 3. increase the phrase slop param, to be more lenient in matching 
"phrases". (as I say this I'm not sure it would actually help your case, 
because you're dealing with explicit phrases, and iirc phrase slop may only 
configure _implicit_ ("pf") phrase searches?)

The _best_ approach would be to configure your index-time analysis chain(s) so 
that they don't have multi-term "expand" synonyms, and WDGF either only splits 
("generate*Parts", etc.) or only catenates ("catenate*", "preserveOriginal"). 
One approach that can work is to index into two fields, each with a dedicated 
index-time analysis type (split or catenate).

Some relevant issues:
https://issues.apache.org/jira/browse/LUCENE-7398
https://issues.apache.org/jira/browse/LUCENE-4312

Michael

On Wed, Nov 17, 2021 at 11:18 AM Scott <qm...@top-consulting.net> wrote:

> My apologies for the previous e-mail…should have never sent that as 
> html
>
> I am facing a weird issue, possibly caused by my config.
>
> I have indexed a document which has a field called subject, subject is 
> defined as:
>
> <field name="subject" type="partial_text_general"/>
>
>   <fieldType name="partial_text_general" class="solr.TextField"
> positionIncrementGap="100" multiValued="true">
>         <analyzer type="index">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.WordDelimiterGraphFilterFactory"
> generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> catenateWords="1" catenateNumbers="1" preserveOriginal="1"
> splitOnNumerics="0"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.EnglishPossessiveFilterFactory"/>
>                 <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>                 <filter class="solr.EnglishMinimalStemFilterFactory"/>
>                 <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> maxGramSize="45" />
>         </analyzer>
>         <analyzer type="query">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                 <filter class="solr.WordDelimiterGraphFilterFactory"
> generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> catenateWords="1" catenateNumbers="1" splitOnNumerics="0"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.EnglishPossessiveFilterFactory"/>
>                 <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>                 <filter class="solr.EnglishMinimalStemFilterFactory"/>
>         </analyzer>
>   </fieldType>
>
> I have a document with subject field: <str>cobrancas E-mail marketing 
> em dezembro, 2020 - referente ao uso de novembro</str>
>
> If I search for <str name="q">subject:"cobrancas e-mail"</str> then it 
> finds the document, but if I search for <str 
> name="q">subject:"cobrancas e-mail marketing"</str> I have no match.
>
> Why would this happen ?
>
> Thank you!
>
>
>

RE: Solr limit in words search - take 2

Reply via email to