RE: Solr limit in words search - take 2

Scott Wed, 17 Nov 2021 10:52:16 -0800

Ok, I'll add <filter class="solr.FlattenGraphFilterFactory"/> in the indexer 
and see what happens.


It's so weird that it works, even in this state, when the docs say : This 
filter _must_ be included

I would have expected the indexer to throw errors if this filter is really 
required...

Thanks!

-----Original Message-----
From: Michael Gibney <mich...@michaelgibney.net> 
Sent: Wednesday, November 17, 2021 1:15 PM
To: users@solr.apache.org
Subject: Re: Solr limit in words search - take 2

Right, sorry I forgot to mention the absence of FlattenGraphFilter. Tbh I'm not 
100% clear on what cases it helps out with; but at the end of the day it has no 
effect on underlying issues having to do with the fact that if your index-time 
analysis chain produces "graph" tokenstreams, the Lucene 
`[Default]IndexingChain` completely disregards the PositionLengthAttribute, 
which is necessary to properly reconstruct the indexed graph at query time.

It's possible FlattenGraphFilter might help your case -- in fact if you do 
nothing else I'd certainly suggest that you use it. But I'm certain that there 
are some classes of problems that are fundamentally related to LUCENE-4312, and 
FlattenGraphFilter can't fix them. I'll be curious to know whether the addition 
of FlattenGraphFilter helps in your case, though!

Michael

On Wed, Nov 17, 2021 at 12:57 PM Scott <qm...@top-consulting.net> wrote:

> Could this be related ?
>
>
> https://solr.apache.org/guide/6_6/filter-descriptions.html#FilterDescr
> iptions-WordDelimiterGraphFilter
>
> "If you use this filter during indexing, you must follow it with a 
> Flatten Graph Filter to squash tokens on top of one another like the 
> Word Delimiter Filter, because the indexer can’t directly consume a 
> graph. To get fully correct positional queries when tokens are split, 
> you should instead use this filter at query time."
>
>
>
> -----Original Message-----
> From: Michael Gibney <mich...@michaelgibney.net>
> Sent: Wednesday, November 17, 2021 12:07 PM
> To: users@solr.apache.org
> Subject: Re: Solr limit in words search - take 2
>
> This is not the most thorough answer, but hopefully gets you headed in 
> the right direction:
>
> Very strange things can happen when your index-time analysis chain 
> generates "graph" token-streams (as yours does). A couple of things 
> you could try:
> 1. experiment with setting `enableGraphQueries=false` on the fieldtype 2.
> upgrading to solr >=8.1 may address your issue partially, via
> LUCENE-8730 -- here I go out on a limb in guessing that you're not 
> _already_ on 8.1+ :-) 3. increase the phrase slop param, to be more 
> lenient in matching "phrases". (as I say this I'm not sure it would 
> actually help your case, because you're dealing with explicit phrases, 
> and iirc phrase slop may only configure _implicit_ ("pf") phrase 
> searches?)
>
> The _best_ approach would be to configure your index-time analysis
> chain(s) so that they don't have multi-term "expand" synonyms, and 
> WDGF either only splits ("generate*Parts", etc.) or only catenates 
> ("catenate*", "preserveOriginal"). One approach that can work is to 
> index into two fields, each with a dedicated index-time analysis type (split 
> or catenate).
>
> Some relevant issues:
> https://issues.apache.org/jira/browse/LUCENE-7398
> https://issues.apache.org/jira/browse/LUCENE-4312
>
> Michael
>
> On Wed, Nov 17, 2021 at 11:18 AM Scott <qm...@top-consulting.net> wrote:
>
> > My apologies for the previous e-mail…should have never sent that as 
> > html
> >
> > I am facing a weird issue, possibly caused by my config.
> >
> > I have indexed a document which has a field called subject, subject 
> > is defined as:
> >
> > <field name="subject" type="partial_text_general"/>
> >
> >   <fieldType name="partial_text_general" class="solr.TextField"
> > positionIncrementGap="100" multiValued="true">
> >         <analyzer type="index">
> >                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                 <filter class="solr.WordDelimiterGraphFilterFactory"
> > generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> > catenateWords="1" catenateNumbers="1" preserveOriginal="1"
> > splitOnNumerics="0"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >                 <filter class="solr.EnglishPossessiveFilterFactory"/>
> >                 <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> >                 <filter class="solr.EnglishMinimalStemFilterFactory"/>
> >                 <filter class="solr.EdgeNGramFilterFactory"
> minGramSize="2"
> > maxGramSize="45" />
> >         </analyzer>
> >         <analyzer type="query">
> >                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                 <filter class="solr.SynonymFilterFactory"
> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >                 <filter class="solr.WordDelimiterGraphFilterFactory"
> > generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> > catenateWords="1" catenateNumbers="1" splitOnNumerics="0"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >                 <filter class="solr.EnglishPossessiveFilterFactory"/>
> >                 <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> >                 <filter class="solr.EnglishMinimalStemFilterFactory"/>
> >         </analyzer>
> >   </fieldType>
> >
> > I have a document with subject field: <str>cobrancas E-mail 
> > marketing em dezembro, 2020 - referente ao uso de novembro</str>
> >
> > If I search for <str name="q">subject:"cobrancas e-mail"</str> then 
> > it finds the document, but if I search for <str 
> > name="q">subject:"cobrancas e-mail marketing"</str> I have no match.
> >
> > Why would this happen ?
> >
> > Thank you!
> >
> >
> >
>
>

RE: Solr limit in words search - take 2

Reply via email to