[
https://issues.apache.org/jira/browse/SOLR-17928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18034227#comment-18034227
]
Chris M. Hostetter commented on SOLR-17928:
-------------------------------------------
{quote}Previously, improving accuracy required increasing topK (which returns
more results), but efSearch enables exploring more candidates while still
receiving exactly topK results. And default efSearch is 2*topK.
{quote}
I understand how it works – my concern is that adding this new option with a
default of "topK * 2" means that for any _existing_ query (that isn't modified
in advance to specify an {{efSearch}} param) upgrading after this feature is
added is going to do twice as much "work" – effectively doubling the graph walk
time of the query. (correct?)
Based on my interaction with heavy KNN users, that is likely to
surprise/confuse/frustrate a lot of people – because they have already tuned
their knn queries to use a carefully chosen topK value (they pass to solr)
based on how well it impacts the relevancy of the "top N < K" results they
_actually_ look at in the response.
Example: they only care about the top N=100 results, but they pass topK=500 to
Solr because:
* when they tried topK=100 the results were faster, but not accurate enough to
be useful
* when they tried topK=1000 the results were even more accurate, but too slow
to be worth the added improvements
An {{efSearch}} default of "topK * 2" would likely improve the relevancy of
existing queries, but it would effectively be a "performance backcompat break"
for existing users who upgrade.
*That's what i'm hung up on: whether that is a good idea?*
----
In the latest patch, the user gets an error if {{efSearch < topK}} ; which
makes sense at a low level – but seems like it might be error prone if/when
users are turning their query params -- especially combined with a default
{{efSearch}} that is _relative_ to the {{topK}} value:
* If i _only_ set the {{topK=K}} param, and i increase the value of "K", then i
not only increase the amount of results matched, but also the amount of graph
walking done.
* If i set _both_ {{topK=K efSearch=X}}, and i increase the value of "K", then
I _only_ increase the amount of results matched, the amount of graph walking
_stays the same_ (and my overall relevancy effectively decreases)
I'm wondering if (instead of an "integer" {{efSearch >= topK}} param) it would
make more sense to have a "float" {{efSearchFactor >= 1.0}} param that would be
multiplied by the {{topK}} param to determine the effective {{efSearch}} value
used internally in the code.
That way:
* Changing (only) {{topK}} impacts the number of docs matched by the query,
keeping the _relative_ amount of graph walking the same (regardless of whether
you have an explicit {{efSearchFactor}} param or use the default)
* if you do have an {{efSearchFacter}} param set, changing it can tune the
amount of graph walking done, _relative_ to the {{topK}} value you use, in a
way that will scale if/when you change {{topK}}
?
----
{quote}ElasticSearch also has a similar parameter called num_candidates which
achieves something similar, and they default to 1.5*topK
{quote}
I really don't care what ES does or what their defaults are – I care what makes
the most sense for (existing & new) Solr users
> Add efSearch parameter to knn query
> -----------------------------------
>
> Key: SOLR-17928
> URL: https://issues.apache.org/jira/browse/SOLR-17928
> Project: Solr
> Issue Type: Improvement
> Components: vector-search
> Reporter: Ishan Chattopadhyaya
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Right now, only topK can be requested. efSearch is a standard overfetch
> parameter.
> Proposing that we add it for better recall accuracy.
> (FYI, Elasticsearch calls it num_candidates. Commonly referred to as
> efSearch, similar to efConstruction that we call beamWidth)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]