Hi Drini, from the analysis admin pag you shared it seems un-correct to me. I would investigate a bit further, reproduce it via tests and check the FlattenGraph Token FIlter code!
Cheers -------------------------- Alessandro Benedetti Apache Lucene/Solr Committer Director, R&D Software Engineer, Search Consultant www.sease.io On Mon, 17 May 2021 at 21:39, Drini Cami <[email protected]> wrote: > Hi Alessandro, > > Which code do you recommend I look into? The Solr's > FlattenGraphFilterFactoryor or a setting in my Solr schema? > > This is the final result of the default index analyzer and query analyzer > for "the mark of the crown" with position data: > > Index: `[{ text: "mark", start: 4, end: 8, positionLength: 1, *position: > 2*, > ... }, { text: "crown", start: 16, end: 21, positionLength: 1, *position: > 4*, > ... }]` > Query: ` [{ text: "mark", start: 4, end: 8, positionLength: 1, *position: > 2*, > ... }, { text: "crown", start: 16, end: 21, positionLength: 1, *position: > 5*, > ... }]` > > Here's a screenshot of the full verbose analyzer tool output: > > https://user-images.githubusercontent.com/6251786/118551202-a4b90e00-b72b-11eb-92f1-4d4b13828d83.png > > The only difference is that crown is `position: 5` in the Query Analyzer. > And in the Index Analyzer, it was set to `position: 4` after passing > through the FlattenGraphFilter. Do you think this might then in fact be a > potential bug with the FlattenGraphFilter? Or does this look like expected > behaviour? > > Thank you, > Drini > > On 2021/05/11 11:38:57, Alessandro Benedetti <[email protected]> wrote: > > Hi Drini,> > > I would recommend investigating the code a bit, that token filter is > meant> > > to flat multiple terms at the same position to make it super simple so > It> > > seems suspicious that merging two adjacent tokens putting generated> > > incorrect positions is what happens.> > > Have you checked the positionLength, position attributes of the tokens> > > generated?> > > > > Cheers> > > --------------------------> > > Alessandro Benedetti> > > Apache Lucene/Solr Committer> > > Director, R&D Software Engineer, Search Consultant> > > > > www.sease.io> > > > > > > On Thu, 6 May 2021 at 19:54, Drini Cami <[email protected]> wrote:> > > > > > Hello! I have a question about the text_en_splitting fieldType (solr > 8.8.2,> > > > very vanilla schema). I noticed that it was failing for queries like:> > > > `title:"The> > > > Mark of the Crown"`, but succeeding for queries like `title:The Mark of > the> > > > Crown`. Using the solr analysis tool, I noticed that the index > analyzer> > > > converts "The Mark of the Crown" to `[_, mark, _, crown]`, but the > query> > > > analyzer converts it to `[_, mark, _, _, crown]`. I then noticed the > index> > > > analyzer has as a final filter FlattenGraphFilterFactory, which seems > to> > > > combine adjacent `_`. I tried also adding FlattenGraphFilterFactory to > the> > > > query analyzer and that fixed the issue. Is this a reasonable solution? > If> > > > so, should that be the default? Or am I using the wrong fieldType> > > > altogether?> > > >> > > > Thank you,> > > >> > > > Drini> > > >> > > >
