[jira] [Created] (SOLR-16651) Optimize execution of KNN sub-query to apply it only on documents remaining after the main query

Gabriel Magno (Jira) Fri, 10 Feb 2023 08:02:10 -0800

Gabriel Magno created SOLR-16651:
------------------------------------

             Summary: Optimize execution of KNN sub-query to apply it only on 
documents remaining after the main query
                 Key: SOLR-16651
                 URL: https://issues.apache.org/jira/browse/SOLR-16651
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
          Components: query
    Affects Versions: 9.1.1
            Reporter: Gabriel Magno



Solr 9.1 introduced pre-filtering for KNN queries, which is great and is 
working fine when the KNN is the main query.

I was wondering rather it would be possible to make something similar, but for 
the case of KNN being a sub-query instead of the main query (q). Let me show an 
example use case with the films example.

I want to query for films with “the” in the name, and filter only films with 
genre “Drama”, then calculate the similarity of these films vectors according 
to my target vector. The idea is making a simple lexical query, and using the 
KNN sub-query to calculate similarities (not really sorting by the similarity 
necessarily). Here is an example query:
 * URL: 
[http://localhost:8983/solr/#/films/query?q=name:the&fq=genre:Drama&my_similarity=%7B!knn%20f%3Dfilm_vector%20topK%3D10000%7D%5B0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0%5D&fl=*,$my_similarity]
 * Params:
 ** {*}q{*}=name:the
 ** {*}fq{*}=genre:Drama
 ** {*}my_similarity{*}=\{!knn f=film_vector 
topK=10000}[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
 ** {*}fl{*}=*,$my_similarity

This query works fine, the problem is that the `my_similarity` subquery runs 
for all of the 1,100 film documents, instead of running only for the 51 that 
are relevant for the query. For a small collection like this it does not make a 
difference, but I have a collection with 12 million documents that makes 
queries similar like this to run very slow, even tough the retrieval being 
small.

I tried using the cache and cost parameters to "force" the KNN sub-query 
running after the main query (`\{!knn cache=false cost=101 f=film_vector 
topK=10000}[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]`), but it does not work (I 
guess the PostFilter is not implemented for KNN).

This issue might be related to the fix of the StackOverflow bug of frange with 
KNN (https://issues.apache.org/jira/browse/SOLR-16567).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Created] (SOLR-16651) Optimize execution of KNN sub-query to apply it only on documents remaining after the main query

Reply via email to