Ok, I'll add <filter class="solr.FlattenGraphFilterFactory"/> in the indexer and see what happens.
It's so weird that it works, even in this state, when the docs say : This filter _must_ be included I would have expected the indexer to throw errors if this filter is really required... Thanks! -----Original Message----- From: Michael Gibney <mich...@michaelgibney.net> Sent: Wednesday, November 17, 2021 1:15 PM To: users@solr.apache.org Subject: Re: Solr limit in words search - take 2 Right, sorry I forgot to mention the absence of FlattenGraphFilter. Tbh I'm not 100% clear on what cases it helps out with; but at the end of the day it has no effect on underlying issues having to do with the fact that if your index-time analysis chain produces "graph" tokenstreams, the Lucene `[Default]IndexingChain` completely disregards the PositionLengthAttribute, which is necessary to properly reconstruct the indexed graph at query time. It's possible FlattenGraphFilter might help your case -- in fact if you do nothing else I'd certainly suggest that you use it. But I'm certain that there are some classes of problems that are fundamentally related to LUCENE-4312, and FlattenGraphFilter can't fix them. I'll be curious to know whether the addition of FlattenGraphFilter helps in your case, though! Michael On Wed, Nov 17, 2021 at 12:57 PM Scott <qm...@top-consulting.net> wrote: > Could this be related ? > > > https://solr.apache.org/guide/6_6/filter-descriptions.html#FilterDescr > iptions-WordDelimiterGraphFilter > > "If you use this filter during indexing, you must follow it with a > Flatten Graph Filter to squash tokens on top of one another like the > Word Delimiter Filter, because the indexer can’t directly consume a > graph. To get fully correct positional queries when tokens are split, > you should instead use this filter at query time." > > > > -----Original Message----- > From: Michael Gibney <mich...@michaelgibney.net> > Sent: Wednesday, November 17, 2021 12:07 PM > To: users@solr.apache.org > Subject: Re: Solr limit in words search - take 2 > > This is not the most thorough answer, but hopefully gets you headed in > the right direction: > > Very strange things can happen when your index-time analysis chain > generates "graph" token-streams (as yours does). A couple of things > you could try: > 1. experiment with setting `enableGraphQueries=false` on the fieldtype 2. > upgrading to solr >=8.1 may address your issue partially, via > LUCENE-8730 -- here I go out on a limb in guessing that you're not > _already_ on 8.1+ :-) 3. increase the phrase slop param, to be more > lenient in matching "phrases". (as I say this I'm not sure it would > actually help your case, because you're dealing with explicit phrases, > and iirc phrase slop may only configure _implicit_ ("pf") phrase > searches?) > > The _best_ approach would be to configure your index-time analysis > chain(s) so that they don't have multi-term "expand" synonyms, and > WDGF either only splits ("generate*Parts", etc.) or only catenates > ("catenate*", "preserveOriginal"). One approach that can work is to > index into two fields, each with a dedicated index-time analysis type (split > or catenate). > > Some relevant issues: > https://issues.apache.org/jira/browse/LUCENE-7398 > https://issues.apache.org/jira/browse/LUCENE-4312 > > Michael > > On Wed, Nov 17, 2021 at 11:18 AM Scott <qm...@top-consulting.net> wrote: > > > My apologies for the previous e-mail…should have never sent that as > > html > > > > I am facing a weird issue, possibly caused by my config. > > > > I have indexed a document which has a field called subject, subject > > is defined as: > > > > <field name="subject" type="partial_text_general"/> > > > > <fieldType name="partial_text_general" class="solr.TextField" > > positionIncrementGap="100" multiValued="true"> > > <analyzer type="index"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.WordDelimiterGraphFilterFactory" > > generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1" > > catenateWords="1" catenateNumbers="1" preserveOriginal="1" > > splitOnNumerics="0"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.EnglishPossessiveFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory" > > protected="protwords.txt"/> > > <filter class="solr.EnglishMinimalStemFilterFactory"/> > > <filter class="solr.EdgeNGramFilterFactory" > minGramSize="2" > > maxGramSize="45" /> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.SynonymFilterFactory" > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > > <filter class="solr.WordDelimiterGraphFilterFactory" > > generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1" > > catenateWords="1" catenateNumbers="1" splitOnNumerics="0"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.EnglishPossessiveFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory" > > protected="protwords.txt"/> > > <filter class="solr.EnglishMinimalStemFilterFactory"/> > > </analyzer> > > </fieldType> > > > > I have a document with subject field: <str>cobrancas E-mail > > marketing em dezembro, 2020 - referente ao uso de novembro</str> > > > > If I search for <str name="q">subject:"cobrancas e-mail"</str> then > > it finds the document, but if I search for <str > > name="q">subject:"cobrancas e-mail marketing"</str> I have no match. > > > > Why would this happen ? > > > > Thank you! > > > > > > > >