[ https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rudi Seitz updated SOLR-16594: ------------------------------ Summary: improve eDismax strategy for generating a term-centric query (was: eDismax should use startOffset when converting per-field to per-term queries) > improve eDismax strategy for generating a term-centric query > ------------------------------------------------------------ > > Key: SOLR-16594 > URL: https://issues.apache.org/jira/browse/SOLR-16594 > Project: Solr > Issue Type: Improvement > Components: query parsers > Reporter: Rudi Seitz > Priority: Major > > When parsing a multi-term query that spans multiple fields, edismax sometimes > switches from a "term-centric" to a "field-centric" approach. This creates > inconsistent semantics for the {{mm}} or "min should match" parameter and may > have an impact on scoring. The goal of this ticket is to improve the approach > that edismax uses for generating term-centric queries so that edismax would > less frequently "give up" and resort to the field-centric approach. > Specifically, we propose that edismax should create a dismax query for each > distinct startOffset found among the tokens emitted by the field analyzers. > Since the relevant code in edismax works with Query objects that contain > Terms, and since Terms do not hold the startOffset of the Token from which > Term was derived, some plumbing work would need to be done to make the > startOffsets available to edismax. > > BACKGROUND: > > If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric > interpretation of the query would contain a clause for each field: > {{ (f1:foo f1:bar) (f2:foo f2:bar)}} > while a term-centric interpretation would contain a clause for each term: > {{ (f1:foo f2:foo) (f1:bar f2:bar)}} > The challenge in generating a term-centric query is that we need to take the > tokens that emerge from each field's analysis chain and group them according > to the terms in the user's original query. However, the tokens that emerge > from an analysis chain do not store a reference to their corresponding input > terms. For example, if we pass "foo bar" through an ngram analyzer we would > get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it > may be obvious to a human that "f", "fo", and "foo" all come from the "foo" > input term, and that "b", "ba", and "bar" come from the "bar" input term, > there is not always an easy way for edismax to see this connection. When > {{{}sow=true{}}}, edismax passes each whitespace-separated term through each > analysis chain separately, and therefore edismax "knows" that the output > tokens from any given analysis chain are all derived from the single input > term that was passed into that chain. However, when {{{}sow=false{}}}, > edismax passes the entire multi-term query through each analysis chain as a > whole, resulting in multiple output tokens that are not "connected" to their > source term. > Edismax still tries to generate a term-centric query when {{sow=false}} by > first generating a boolean query for each field, and then checking whether > all of these per-field queries have the same structure. The structure will > generally be uniform if each analysis chain emits the same number of tokens > for the given input. If one chain has a synonym filter and another doesn’t, > this uniformity may depend on whether a synonym rule happened to match a term > in the user's input. > Assuming the per-field boolean queries _do_ have the same structure, edismax > reorganizes them into a new boolean query. The new query contains a dismax > for each clause position in the original queries. If the original queries are > {{(f1:foo f1:bar)}} and {{(f2:foo f2:bar)}} we can see they have two clauses > each, so we would get a dismax containing all the first position clauses > {{(f1:foo f1:bar)}} and another dismax containing all the second position > clauses {{{}(f2:foo f2:bar){}}}. > We can see that edismax is using clause position as a heuristic to reorganize > the per-field boolean queries into per-term ones, even though it doesn't know > for sure which clauses inside those per-field boolean queries are related to > which input terms. We propose that a better way of reorganizing the per-field > boolean queries is to create a dismax for each distinct startOffset seen > among the tokens in the token streams emitted by each field analyzer. The > startOffset of a token (rather, a PackedTokenAttributeImpl) is "the position > of the first character corresponding to this token in the source text". > We propose that startOffset is a resonable way of matching output tokens up > with the input terms that gave rise to them. For example, if we pass "foo > bar" through an ngram analysis chain we see that the foo-related tokens all > have startOffset=0 while the bar-related tokens all have startOffset=4. > Likewise, tokens that are generated via synonym expansion have a startOffset > that points to the beginning of the matching input term. For example, if the > query "GB" generates "GB gib gigabyte gigabytes" via synonym expansion, all > of those four tokens would have startOffset=0. > Here's an example of how the proposed edismax logic would work. Let's say a > user searches for "foo bar" across two fields, f1 and f2, where f1 uses a > standard text analysis chain while f2 generates ngrams. We would get > field-centric queries {{(f1:foo f1:bar)}} and ({{{}f2:f f2:fo f2:foo f2:b > f2:ba f2:bar){}}}. Edismax's "all same query structure" check would fail > here, but if we look for the unique startOffsets seen among all the tokens we > would find offsets 0 and 4. We could then generate one clause for all the > startOffset=0 tokens {{(f1:foo f2:f f2:fo f2:foo)}} and another for all the > startOffset=4 tokens: {{{}(f1:bar f2:b f2:ba f2:bar){}}}. This would > effectively give us a "term-centric" query with consistent mm and scoring > semantics, even though the analysis chains are not "compatible." > As mentioned, there would be significant plumbing needed to make startOffsets > available to edismax in the code where the per-field queries are converted > into per-term queries. Modifications would possibly be needed in both the > Solr and Lucene repos. This ticket is logged in hopes of gathering feedback > about whether this is a worthwhile/viable approach to pursue further. > > Related tickets: > https://issues.apache.org/jira/browse/SOLR-12779 > https://issues.apache.org/jira/browse/SOLR-15407 > > Related blog entries: > [https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities] > [https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org