Thanks Michael, that may well be the issue! I need to reorder the chain and thanks for the suggestion on the WordDelimiterGraphFilter which I'll look into as well.
On Wed, 6 Apr 2022 at 17:14, Michael Gibney <mich...@michaelgibney.net> wrote: > I think the behavior you're seeing is a consequence of the fact that you're > applying index-time stopword filtering *before* the tokens are further > manipulated by WordDelimiterGraphFilter. E.g.: > > "the token-is-retained" => "the" "token-is" "retained" => "the" "token" > "is" "retained" > > In the case above, "token-is" doesn't match the stopword list, but it is > subsequently decomposed into "token" and "is". So in fact you're likely > getting other unexpected query behavior; it's just easier to see/more > explicit with faceting. > > It may also be worth noting that faceting on tokenized fields (TextField) > currently only works via "uninversion" of indexed field values (i.e., > docValues are not supported). This can be quite resource-intensive, and is > probably best to avoid unless you have a specific need to do this (which > you may well indeed have!). > > Also, index-time WordDelimiterGraphFilter configured to both "split" and > "catenate" tokens can yield subtly strange results in phrase queries, if > that matters to you. > > Michael > > On Wed, Apr 6, 2022 at 4:52 AM Dan Rosher <rosh...@gmail.com> wrote: > > > Hi Michael, > > > > Here are the field and fieldType with a result snippet. > > > > I've checked the stopword list, and words like "a" or "be" are in it. > I've > > also used the UI analysis to check that they indeed should be removed > when > > indexed and queried. > > > > Many thanks, > > Dan > > > > *example results:* > > .... > > "facets": { > > "count": 58215, > > "description": { > > "buckets": [ > > { > > "val": "a", > > "count": 4, > > "relatedness": { > > "relatedness": 0.98239, > > "foreground_popularity": 0.01279, > > "background_popularity": 0.01279 > > } > > }, > > { > > "val": "be", > > "count": 6, > > "relatedness": { > > "relatedness": 0.98239, > > "foreground_popularity": 0.01279, > > "background_popularity": 0.01279 > > } > > }, > > .... > > > > *field*: <field name="description" type="textgen-stemmed" > > indexed="true" stored="true" multiValued="false"/> > > *fieldtype*: > > <fieldType name="textgen-stemmed" class="solr.TextField" > > positionIncrementGap="100"> > > <similarity class="solr.ClassicSimilarityFactory"/> > > <analyzer type="index"> > > <charFilter class="solr.PatternReplaceCharFilterFactory" > > pattern="\.$" replacement=""/> > > <charFilter class="solr.PatternReplaceCharFilterFactory" > > pattern="\.\s+" replacement=" "/> > > <charFilter class="solr.PatternReplaceCharFilterFactory" > > pattern="[*,;|/]" replacement=" "/> > > <charFilter class="solr.PatternReplaceCharFilterFactory" > > pattern="(\S+)(\.(?i:net))\b" replacement="$1 $2"/> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.SynonymGraphFilterFactory" > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="stopwords.txt"/> <!-- STOPWORDS HERE --> > > <filter class="solr.WordDelimiterGraphFilterFactory" > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" > > splitOnNumerics="0"/> > > <filter class="solr.FlattenGraphFilterFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory" > > protected="protwords.txt" /> > > <filter class="solr.KStemFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.SynonymGraphFilterFactory" > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="stopwords.txt"/> <!-- STOPWORDS HERE --> > > <filter class="solr.WordDelimiterGraphFilterFactory" > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" > > splitOnNumerics="0"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory" > > protected="protwords.txt" /> > > <filter class="solr.KStemFilterFactory"/> > > </analyzer> > > </fieldType> > > > > > > On Tue, 5 Apr 2022 at 14:58, Michael Gibney <mich...@michaelgibney.net> > > wrote: > > > > > Both `qf` and `relatedness` should be orthogonal to your question, > iiuc. > > > Understanding that your question is mainly about which terms are > included > > > (i.e. included at all -- nevermind ranking), then the only thing that > > > should determine that is the field and fieldType config for the terms > > facet > > > "field" property -- i.e., "description". Can you share that > information, > > > including index-time analysis chain config? > > > > > > On Tue, Apr 5, 2022 at 8:52 AM Dan Rosher <rosh...@gmail.com> wrote: > > > > > > > Hi, > > > > > > > > If I run a facet on relatedness on a qf field (examples below) which > > has > > > > stopword removal, I get stopwords in the json facet? > > > > > > > > Anyone know why, and if this can be avoided? > > > > > > > > Many thanks, > > > > Dan > > > > > > > > ================= > > > > > > > > Details > > > > Solr 7.7.2 > > > > > > > > http://localhost:8983/solr/collection/select? > > > > q=my query& > > > > defType=edismax& > > > > qf=description& > > > > fore={!type=$defType qf=$qf v=$q}& > > > > back=*:*& > > > > rows=0& > > > > json.facet={ > > > > "description":{ > > > > "type": "terms", > > > > "field": "description", > > > > "sort": { "relatedness": "desc"}, > > > > "mincount": 2, > > > > "limit": 8, > > > > "facet": { > > > > "relatedness": { > > > > "type": "func", > > > > "func": "relatedness($fore,$back)" > > > > } > > > > } > > > > } > > > > } > > > > > > > > > >