[jira] [Updated] (SOLR-17679) Request for Documentation on Hybrid Lexical and Vector Search with Score Breakdown and Cutoff Logic

Khaled Alkhouli (Jira) Thu, 20 Feb 2025 07:03:39 -0800


     [ 
https://issues.apache.org/jira/browse/SOLR-17679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Khaled Alkhouli updated SOLR-17679:
-----------------------------------
    Description: 
Hello Apache Solr team,

I was able to implement a hybrid search engine that combines *lexical search 
(edismax)* and *vector search (KNN-based embeddings)* within a single request. 
The idea is simple:
 * *Lexical Search* retrieves results based on text relevance.
 * *Vector Search* retrieves results based on semantic similarity.
 * *Hybrid Scoring* sums both scores, where a missing score (if a document 
appears in only one search) should be treated as zero.

This approach is working, but *there is a critical lack of documentation* on 
how to properly return individual score components of lexical search (score1) 
vs. vector search (score2 from cosine similarity). Right now, Solr only returns 
the final combined score, but there is no clear way to see {*}how much of that 
score comes from lexical search vs. vector search{*}. This is essential for 
debugging and for fine-tuning ranking strategies.

 

I have implemented the following logic using Python:
{code:java}
def hybrid_search(query, top_k=10):
    embedding = np.array(embed([query]), dtype=np.float32
    embedding = list(embedding[0])
    lxq= rf"""{{!type=edismax 
                qf='text'
                q.op=OR
                tie=0.1
                bq=''
                bf=''
                boost=''
            }}({query})"""
    solr_query = {"params": {
        "q": "{!bool filter=$retrievalStage must=$rankingStage}",
        "rankingStage": 
"{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))",
        "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", # 
Union
        "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)",
        "lexicalQuery": lxq,
        "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}",
        "fl": "text",
        "rows": top_k,
        "fq": [""],
        "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}",
        "rqq": "{!frange l=$cutoff}query($rankingStage)",
        "sort": "score desc",
    }}
    response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query)
    response = response.json()
    return response {code}
h3. *Issues & Missing Documentation*
 # *No Way to Retrieve Individual Scores in a Hybrid Search*
There is no clear documentation on how to return:

 * 
 ** The *lexical search score* separately.
 ** The *vector search score* separately.
 ** The *final combined score* (which Solr already provides).
Right now, we’re left guessing whether the sum of these scores works as 
expected, making debugging and tuning unnecessarily difficult.

 # *No Clear Way to Implement Cutoff Logic in Solr*
In a hybrid search, I need to filter out results that don’t meet a {*}minimum 
score threshold{*}. Right now, I have to implement this in Python, {*}which 
defeats the purpose of using Solr for ranking in the first place{*}.

 * 
 ** How can we enforce a {*}score-based cutoff directly in Solr{*}, without 
external filtering?
 ** The \{!frange} function is mentioned in the documentation but lacks 
{*}clear examples on how to apply it to hybrid search{*}.

h3. *Feature Request / Documentation Improvement*
 * *Provide a way to return individual scores for lexical and vector search in 
the response.* This should be as simple as adding fields like 
{{{}fl=score,lexical_score,vector_score{}}}.
 * *Clarify how to apply cutoff logic in a hybrid search.* This is an essential 
ranking mechanism, and yet, there’s little guidance on how to do this 
efficiently within Solr itself.

Looking forward to a response.

  was:
Hello Apache Solr team,

I was able to implement a hybrid search engine that combines *lexical search 
(edismax)* and *vector search (KNN-based embeddings)* within a single request. 
The idea is simple:
 * *Lexical Search* retrieves results based on text relevance.
 * *Vector Search* retrieves results based on semantic similarity.
 * *Hybrid Scoring* sums both scores, where a missing score (if a document 
appears in only one search) should be treated as zero.

This approach is working, but *there is a critical lack of documentation* on 
how to properly return individual score components of lexical search (score1) 
vs. vector search (score2 from cosine similarity). Right now, Solr only returns 
the final combined score, but there is no clear way to see {*}how much of that 
score comes from lexical search vs. vector search{*}. This is essential for 
debugging and for fine-tuning ranking strategies.

 

I have implemented the following logic using Python:
{code:java}
def hybrid_search(query, top_k=10):
    embedding = np.array(embed([query]), dtype=np.float32
    embedding = list(embedding[0])
    lxq= rf"""{{!type=edismax 
                qf='text'
                q.op=OR
                tie=0.1
                bq=''
                bf=''
                boost=''
            }}({query})"""
    solr_query = {"params": {
        "q": "{!bool filter=$retrievalStage must=$rankingStage}",
        "rankingStage": 
"{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))",
        "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", # 
Union
        "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)",
        "lexicalQuery": lxq,
        "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}",
        "fl": "text",
        "rows": top_k,
        "fq": [""],
        "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}",
        "rqq": "{!frange l=$cutoff}query($rankingStage)",
        "sort": "score desc",
    }}
    response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query)
    response = response.json()
    return response {code}
h3. *Issues & Missing Documentation*
 # *No Way to Retrieve Individual Scores in a Hybrid Search*
There is no clear documentation on how to return:

 ** The *lexical search score* separately.
 ** The *vector search score* separately.
 ** The *final combined score* (which Solr already provides).
Right now, we’re left guessing whether the sum of these scores works as 
expected, making debugging and tuning unnecessarily difficult.

 # *No Clear Way to Implement Cutoff Logic in Solr*
In a hybrid search, I need to filter out results that don’t meet a {*}minimum 
score threshold{*}. Right now, I have to implement this in Python, {*}which 
defeats the purpose of using Solr for ranking in the first place{*}.

 ** How can we enforce a {*}score-based cutoff directly in Solr{*}, without 
external filtering?
 ** The {{{!frange}}} function is mentioned in the documentation but lacks 
{*}clear examples on how to apply it to hybrid search{*}.

h3. *Feature Request / Documentation Improvement*
 * *Provide a way to return individual scores for lexical and vector search in 
the response.* This should be as simple as adding fields like 
{{{}fl=score,lexical_score,vector_score{}}}.
 * *Clarify how to apply cutoff logic in a hybrid search.* This is an essential 
ranking mechanism, and yet, there’s little guidance on how to do this 
efficiently within Solr itself.

Looking forward to a response.


> Request for Documentation on Hybrid Lexical and Vector Search with Score 
> Breakdown and Cutoff Logic
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-17679
>                 URL: https://issues.apache.org/jira/browse/SOLR-17679
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 9.6.1
>            Reporter: Khaled Alkhouli
>            Priority: Major
>              Labels: hybrid-search, search, solr, vector-based-search
>         Attachments: Screenshot from 2025-02-20 16-31-48.png
>
>
> Hello Apache Solr team,
> I was able to implement a hybrid search engine that combines *lexical search 
> (edismax)* and *vector search (KNN-based embeddings)* within a single 
> request. The idea is simple:
>  * *Lexical Search* retrieves results based on text relevance.
>  * *Vector Search* retrieves results based on semantic similarity.
>  * *Hybrid Scoring* sums both scores, where a missing score (if a document 
> appears in only one search) should be treated as zero.
> This approach is working, but *there is a critical lack of documentation* on 
> how to properly return individual score components of lexical search (score1) 
> vs. vector search (score2 from cosine similarity). Right now, Solr only 
> returns the final combined score, but there is no clear way to see {*}how 
> much of that score comes from lexical search vs. vector search{*}. This is 
> essential for debugging and for fine-tuning ranking strategies.
>  
> I have implemented the following logic using Python:
> {code:java}
> def hybrid_search(query, top_k=10):
>     embedding = np.array(embed([query]), dtype=np.float32
>     embedding = list(embedding[0])
>     lxq= rf"""{{!type=edismax 
>                 qf='text'
>                 q.op=OR
>                 tie=0.1
>                 bq=''
>                 bf=''
>                 boost=''
>             }}({query})"""
>     solr_query = {"params": {
>         "q": "{!bool filter=$retrievalStage must=$rankingStage}",
>         "rankingStage": 
> "{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))",
>         "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", 
> # Union
>         "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)",
>         "lexicalQuery": lxq,
>         "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}",
>         "fl": "text",
>         "rows": top_k,
>         "fq": [""],
>         "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}",
>         "rqq": "{!frange l=$cutoff}query($rankingStage)",
>         "sort": "score desc",
>     }}
>     response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query)
>     response = response.json()
>     return response {code}
> h3. *Issues & Missing Documentation*
>  # *No Way to Retrieve Individual Scores in a Hybrid Search*
> There is no clear documentation on how to return:
>  * 
>  ** The *lexical search score* separately.
>  ** The *vector search score* separately.
>  ** The *final combined score* (which Solr already provides).
> Right now, we’re left guessing whether the sum of these scores works as 
> expected, making debugging and tuning unnecessarily difficult.
>  # *No Clear Way to Implement Cutoff Logic in Solr*
> In a hybrid search, I need to filter out results that don’t meet a {*}minimum 
> score threshold{*}. Right now, I have to implement this in Python, {*}which 
> defeats the purpose of using Solr for ranking in the first place{*}.
>  * 
>  ** How can we enforce a {*}score-based cutoff directly in Solr{*}, without 
> external filtering?
>  ** The \{!frange} function is mentioned in the documentation but lacks 
> {*}clear examples on how to apply it to hybrid search{*}.
> h3. *Feature Request / Documentation Improvement*
>  * *Provide a way to return individual scores for lexical and vector search 
> in the response.* This should be as simple as adding fields like 
> {{{}fl=score,lexical_score,vector_score{}}}.
>  * *Clarify how to apply cutoff logic in a hybrid search.* This is an 
> essential ranking mechanism, and yet, there’s little guidance on how to do 
> this efficiently within Solr itself.
> Looking forward to a response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Updated] (SOLR-17679) Request for Documentation on Hybrid Lexical and Vector Search with Score Breakdown and Cutoff Logic

Reply via email to