Re: text_en_splitting with quotes not matching when there are 2 adjacent stopwords

Alessandro Benedetti Tue, 18 May 2021 02:59:22 -0700

Hi Drini,
from the analysis admin pag you shared it seems un-correct to me.
I would investigate a bit further, reproduce it via tests and check the
FlattenGraph Token FIlter code!


Cheers
--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Mon, 17 May 2021 at 21:39, Drini Cami <[email protected]> wrote:

> Hi Alessandro,
>
> Which code do you recommend I look into? The Solr's
> FlattenGraphFilterFactoryor or a setting in my Solr schema?
>
> This is the final result of the default index analyzer and query analyzer
> for "the mark of the crown" with position data:
>
> Index: `[{ text: "mark", start: 4, end: 8, positionLength: 1, *position:
> 2*,
> ... }, { text: "crown", start: 16, end: 21, positionLength: 1, *position:
> 4*,
> ... }]`
> Query: ` [{ text: "mark", start: 4, end: 8, positionLength: 1, *position:
> 2*,
> ... }, { text: "crown", start: 16, end: 21, positionLength: 1, *position:
> 5*,
> ... }]`
>
> Here's a screenshot of the full verbose analyzer tool output:
>
> https://user-images.githubusercontent.com/6251786/118551202-a4b90e00-b72b-11eb-92f1-4d4b13828d83.png
>
> The only difference is that crown is `position: 5` in the Query Analyzer.
> And in the Index Analyzer, it was set to `position: 4` after passing
> through the FlattenGraphFilter. Do you think this might then in fact be a
> potential bug with the FlattenGraphFilter? Or does this look like expected
> behaviour?
>
> Thank you,
> Drini
>
> On 2021/05/11 11:38:57, Alessandro Benedetti <[email protected]> wrote:
> > Hi Drini,>
> > I would recommend investigating the code a bit, that token filter is
> meant>
> > to flat multiple terms at the same position to make it super simple so
> It>
> > seems suspicious that merging two adjacent tokens putting generated>
> > incorrect positions is what happens.>
> > Have you checked the positionLength, position attributes of the tokens>
> > generated?>
> >
> > Cheers>
> > -------------------------->
> > Alessandro Benedetti>
> > Apache Lucene/Solr Committer>
> > Director, R&D Software Engineer, Search Consultant>
> >
> > www.sease.io>
> >
> >
> > On Thu, 6 May 2021 at 19:54, Drini Cami <[email protected]> wrote:>
> >
> > > Hello! I have a question about the text_en_splitting fieldType (solr
> 8.8.2,>
> > > very vanilla schema). I noticed that it was failing for queries like:>
> > > `title:"The>
> > > Mark of the Crown"`, but succeeding for queries like `title:The Mark of
> the>
> > > Crown`. Using the solr analysis tool, I noticed that the index
> analyzer>
> > > converts "The Mark of the Crown" to `[_, mark, _, crown]`, but the
> query>
> > > analyzer converts it to `[_, mark, _, _, crown]`. I then noticed the
> index>
> > > analyzer has as a final filter FlattenGraphFilterFactory, which seems
> to>
> > > combine adjacent `_`. I tried also adding FlattenGraphFilterFactory to
> the>
> > > query analyzer and that fixed the issue. Is this a reasonable solution?
> If>
> > > so, should that be the default? Or am I using the wrong fieldType>
> > > altogether?>
> > >>
> > > Thank you,>
> > >>
> > > Drini>
> > >>
> >
>

Re: text_en_splitting with quotes not matching when there are 2 adjacent stopwords

Reply via email to