Re: text_en_splitting with quotes not matching when there are 2 adjacent stopwords

Drini Cami Mon, 17 May 2021 13:39:29 -0700

Hi Alessandro,

Which code do you recommend I look into? The Solr's
FlattenGraphFilterFactoryor or a setting in my Solr schema?


This is the final result of the default index analyzer and query analyzer
for "the mark of the crown" with position data:

Index: `[{ text: "mark", start: 4, end: 8, positionLength: 1, *position: 2*,
... }, { text: "crown", start: 16, end: 21, positionLength: 1, *position: 4*,
... }]`
Query: ` [{ text: "mark", start: 4, end: 8, positionLength: 1, *position: 2*,
... }, { text: "crown", start: 16, end: 21, positionLength: 1, *position: 5*,
... }]`

Here's a screenshot of the full verbose analyzer tool output:
https://user-images.githubusercontent.com/6251786/118551202-a4b90e00-b72b-11eb-92f1-4d4b13828d83.png

The only difference is that crown is `position: 5` in the Query Analyzer.
And in the Index Analyzer, it was set to `position: 4` after passing
through the FlattenGraphFilter. Do you think this might then in fact be a
potential bug with the FlattenGraphFilter? Or does this look like expected
behaviour?

Thank you,
Drini

On 2021/05/11 11:38:57, Alessandro Benedetti <a...@sease.io> wrote:
> Hi Drini,>
> I would recommend investigating the code a bit, that token filter is
meant>
> to flat multiple terms at the same position to make it super simple so
It>
> seems suspicious that merging two adjacent tokens putting generated>
> incorrect positions is what happens.>
> Have you checked the positionLength, position attributes of the tokens>
> generated?>
>
> Cheers>
> -------------------------->
> Alessandro Benedetti>
> Apache Lucene/Solr Committer>
> Director, R&D Software Engineer, Search Consultant>
>
> www.sease.io>
>
>
> On Thu, 6 May 2021 at 19:54, Drini Cami <cd...@gmail.com> wrote:>
>
> > Hello! I have a question about the text_en_splitting fieldType (solr
8.8.2,>
> > very vanilla schema). I noticed that it was failing for queries like:>
> > `title:"The>
> > Mark of the Crown"`, but succeeding for queries like `title:The Mark of
the>
> > Crown`. Using the solr analysis tool, I noticed that the index
analyzer>
> > converts "The Mark of the Crown" to `[_, mark, _, crown]`, but the
query>
> > analyzer converts it to `[_, mark, _, _, crown]`. I then noticed the
index>
> > analyzer has as a final filter FlattenGraphFilterFactory, which seems
to>
> > combine adjacent `_`. I tried also adding FlattenGraphFilterFactory to
the>
> > query analyzer and that fixed the issue. Is this a reasonable solution?
If>
> > so, should that be the default? Or am I using the wrong fieldType>
> > altogether?>
> >>
> > Thank you,>
> >>
> > Drini>
> >>
>

Re: text_en_splitting with quotes not matching when there are 2 adjacent stopwords

Reply via email to