Re: solr relatedness weirdness on json facet function

Dan Rosher Thu, 07 Apr 2022 03:42:08 -0700

Thanks Michael, that may well be the issue! I need to reorder the chain and
thanks for the suggestion on the WordDelimiterGraphFilter which I'll look
into as well.


On Wed, 6 Apr 2022 at 17:14, Michael Gibney <mich...@michaelgibney.net>
wrote:

> I think the behavior you're seeing is a consequence of the fact that you're
> applying index-time stopword filtering *before* the tokens are further
> manipulated by WordDelimiterGraphFilter. E.g.:
>
> "the token-is-retained" => "the" "token-is" "retained" => "the" "token"
> "is" "retained"
>
> In the case above, "token-is" doesn't match the stopword list, but it is
> subsequently decomposed into "token" and "is". So in fact you're likely
> getting other unexpected query behavior; it's just easier to see/more
> explicit with faceting.
>
> It may also be worth noting that faceting on tokenized fields (TextField)
> currently only works via "uninversion" of indexed field values (i.e.,
> docValues are not supported). This can be quite resource-intensive, and is
> probably best to avoid unless you have a specific need to do this (which
> you may well indeed have!).
>
> Also, index-time WordDelimiterGraphFilter configured to both "split" and
> "catenate" tokens can yield subtly strange results in phrase queries, if
> that matters to you.
>
> Michael
>
> On Wed, Apr 6, 2022 at 4:52 AM Dan Rosher <rosh...@gmail.com> wrote:
>
> > Hi Michael,
> >
> > Here are the field and fieldType with a result snippet.
> >
> > I've checked the stopword list, and words like "a" or "be"  are in it.
> I've
> > also used the UI analysis to check that they indeed should be removed
> when
> > indexed and queried.
> >
> > Many thanks,
> > Dan
> >
> > *example results:*
> > ....
> >   "facets": {
> >     "count": 58215,
> >     "description": {
> >       "buckets": [
> >         {
> >           "val": "a",
> >           "count": 4,
> >           "relatedness": {
> >             "relatedness": 0.98239,
> >             "foreground_popularity": 0.01279,
> >             "background_popularity": 0.01279
> >           }
> >         },
> >         {
> >           "val": "be",
> >           "count": 6,
> >           "relatedness": {
> >             "relatedness": 0.98239,
> >             "foreground_popularity": 0.01279,
> >             "background_popularity": 0.01279
> >           }
> >         },
> > ....
> >
> > *field*:        <field name="description"   type="textgen-stemmed"
> > indexed="true"  stored="true"  multiValued="false"/>
> > *fieldtype*:
> >        <fieldType name="textgen-stemmed" class="solr.TextField"
> > positionIncrementGap="100">
> >             <similarity class="solr.ClassicSimilarityFactory"/>
> >             <analyzer type="index">
> >                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> > pattern="\.$" replacement=""/>
> >                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> > pattern="\.\s+" replacement=" "/>
> >                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> > pattern="[*,;|/]" replacement=" "/>
> >                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> > pattern="(\S+)(\.(?i:net))\b" replacement="$1 $2"/>
> >                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                 <filter class="solr.SynonymGraphFilterFactory"
> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"/>                        <!-- STOPWORDS HERE -->
> >                 <filter class="solr.WordDelimiterGraphFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"
> > splitOnNumerics="0"/>
> >                 <filter class="solr.FlattenGraphFilterFactory"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >                 <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt" />
> >                 <filter class="solr.KStemFilterFactory"/>
> >             </analyzer>
> >             <analyzer type="query">
> >                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                 <filter class="solr.SynonymGraphFilterFactory"
> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"/>                      <!-- STOPWORDS HERE -->
> >                 <filter class="solr.WordDelimiterGraphFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
> > splitOnNumerics="0"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >                 <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt" />
> >                 <filter class="solr.KStemFilterFactory"/>
> >             </analyzer>
> >         </fieldType>
> >
> >
> > On Tue, 5 Apr 2022 at 14:58, Michael Gibney <mich...@michaelgibney.net>
> > wrote:
> >
> > > Both `qf` and `relatedness` should be orthogonal to your question,
> iiuc.
> > > Understanding that your question is mainly about which terms are
> included
> > > (i.e. included at all -- nevermind ranking), then the only thing that
> > > should determine that is the field and fieldType config for the terms
> > facet
> > > "field" property -- i.e., "description". Can you share that
> information,
> > > including index-time analysis chain config?
> > >
> > > On Tue, Apr 5, 2022 at 8:52 AM Dan Rosher <rosh...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > If I run a facet on relatedness on a qf field (examples below) which
> > has
> > > > stopword removal, I get stopwords in the json facet?
> > > >
> > > > Anyone know why, and if this can be avoided?
> > > >
> > > > Many thanks,
> > > > Dan
> > > >
> > > > =================
> > > >
> > > > Details
> > > > Solr 7.7.2
> > > >
> > > > http://localhost:8983/solr/collection/select?
> > > > q=my query&
> > > > defType=edismax&
> > > > qf=description&
> > > > fore={!type=$defType qf=$qf v=$q}&
> > > > back=*:*&
> > > > rows=0&
> > > > json.facet={
> > > >   "description":{
> > > >     "type": "terms",
> > > >     "field": "description",
> > > >     "sort": { "relatedness": "desc"},
> > > >     "mincount": 2,
> > > >     "limit": 8,
> > > >     "facet": {
> > > >         "relatedness": {
> > > >             "type": "func",
> > > >             "func": "relatedness($fore,$back)"
> > > >         }
> > > >     }
> > > >   }
> > > > }
> > > >
> > >
> >
>

Re: solr relatedness weirdness on json facet function

Reply via email to