[jira] [Updated] (SOLR-17679) Request for Documentation on Hybrid Lexical and Vector Search with Score Breakdown and Cutoff Logic

Khaled Alkhouli (Jira) Thu, 20 Feb 2025 06:57:05 -0800


     [ 
https://issues.apache.org/jira/browse/SOLR-17679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Khaled Alkhouli updated SOLR-17679:
-----------------------------------
    Description: 
Hello Apache Solr team,

I was able to implement a hybrid search engine that combines *lexical search 
(edismax)* and *vector search (KNN-based embeddings)* within a single request. 
The idea is simple:
 * *Lexical Search* retrieves results based on text relevance.
 * *Vector Search* retrieves results based on semantic similarity.
 * *Hybrid Scoring* sums both scores, where a missing score (if a document 
appears in only one search) should be treated as zero.

This approach is working, but *there is a critical lack of documentation* on 
how to properly return individual score components of lexical search (score1) 
vs. vector search (score2 from cosine similarity). Right now, Solr only returns 
the final combined score, but there is no clear way to see {*}how much of that 
score comes from lexical search vs. vector search{*}. This is essential for 
debugging and for fine-tuning ranking strategies.

 

I have implemented the following logic using Python:
{code:java}
def hybrid_search(query, top_k=10):
    embedding = np.array(embed([query]), dtype=np.float32
    embedding = list(embedding[0])
    lxq= rf"""{{!type=edismax 
                qf='text'
                q.op=OR
                tie=0.1
                bq=''
                bf=''
                boost=''
            }}({query})"""
    solr_query = {"params": {
        "q": "{!bool filter=$retrievalStage must=$rankingStage}",
        "rankingStage": 
"{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))",
        "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", # 
Union
        "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)",
        "lexicalQuery": lxq,
        "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}",
        "fl": "text",
        "rows": top_k,
        "fq": [""],
        "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}",
        "rqq": "{!frange l=$cutoff}query($rankingStage)",
        "sort": "score desc",
    }}
    response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query)
    response = response.json()
    return response {code}
h3. *Issues & Missing Documentation*
 # *No Way to Retrieve Individual Scores in a Hybrid Search*
There is no clear documentation on how to return:

 ** The *lexical search score* separately.
 ** The *vector search score* separately.
 ** The *final combined score* (which Solr already provides).
Right now, we’re left guessing whether the sum of these scores works as 
expected, making debugging and tuning unnecessarily difficult.

 # *No Clear Way to Implement Cutoff Logic in Solr*
In a hybrid search, I need to filter out results that don’t meet a {*}minimum 
score threshold{*}. Right now, I have to implement this in Python, {*}which 
defeats the purpose of using Solr for ranking in the first place{*}.

 ** How can we enforce a {*}score-based cutoff directly in Solr{*}, without 
external filtering?
 ** The {{{!frange}}} function is mentioned in the documentation but lacks 
{*}clear examples on how to apply it to hybrid search{*}.

h3. *Feature Request / Documentation Improvement*
 * *Provide a way to return individual scores for lexical and vector search in 
the response.* This should be as simple as adding fields like 
{{{}fl=score,lexical_score,vector_score{}}}.
 * *Clarify how to apply cutoff logic in a hybrid search.* This is an essential 
ranking mechanism, and yet, there’s little guidance on how to do this 
efficiently within Solr itself.

Looking forward to a response.

  was:
Hello Apache Solr team,

I am building a hybrid search engine that combines lexical search (traditional 
keyword-based search) and vector search (semantic search using embeddings) in a 
single request. I’m aiming to achieve the following in one request:
 # *Lexical Search:* Using edismax with specified fields and weights.
 # *Vector Search:* Using K-Nearest Neighbors (KNN) based on embeddings.
 # *Hybrid Score Combination:* The final score is the sum of the normalized 
lexical score and the vector search score. If a document appears in only one 
search, the other score should be treated as zero.

I have implemented the following logic using Python:
{code:java}
def hybrid_search(query, top_k=10):
    embedding = np.array(embed([query]), dtype=np.float32
    embedding = list(embedding[0])
    lxq= rf"""{{!type=edismax 
                qf='text'
                q.op=OR
                tie=0.1
                bq=''
                bf=''
                boost=''
            }}({query})"""
    solr_query = {"params": {
        "q": "{!bool filter=$retrievalStage must=$rankingStage}",
        "rankingStage": 
"{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))",
        "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", # 
Union
        "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)",
        "lexicalQuery": lxq,
        "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}",
        "fl": "text",
        "rows": top_k,
        "fq": [""],
        "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}",
        "rqq": "{!frange l=$cutoff}query($rankingStage)",
        "sort": "score desc",
    }}
    response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query)
    response = response.json()
    return response {code}
The response returns documents with a combined score, which I assume is the 
addition of:
 * *Lexical Search Score:* Normalized between 0 and 1.
 * *Vector Search Score:* Already bounded between 0 and 1.

If a document is present in one search but not the other, the score from the 
missing part is added as zero. Attached is an image of the current output.
h3. *Request:*

I would like documentation or guidance on the following:
 # {*}View and Return Individual Scores:{*}{*}{*}1.1 Lexical search score
1.2 Vector search score
1.3 Final combined score (already retrieved)
I would like to display all three scores in the response together for each 
document.
 # *Cutoff Logic:*
I am using a Python function to calculate a cutoff threshold based on the 
scores. Is it possible to implement this cutoff directly in Solr so that only 
documents that pass a certain threshold are returned? If so, how can I achieve 
this within Solr’s query syntax, without relying on external Python logic?
How can I retrieve the following scores in the same request?

 *  

I appreciate any help or documentation that can assist with:
 * Returning separate scores for lexical and vector queries.
 * Implementing cutoff logic natively in Solr.

Thank you!

     Issue Type: Improvement  (was: Task)

> Request for Documentation on Hybrid Lexical and Vector Search with Score 
> Breakdown and Cutoff Logic
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-17679
>                 URL: https://issues.apache.org/jira/browse/SOLR-17679
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 9.6.1
>            Reporter: Khaled Alkhouli
>            Priority: Major
>              Labels: hybrid-search, search, solr, vector-based-search
>         Attachments: Screenshot from 2025-02-20 16-31-48.png
>
>
> Hello Apache Solr team,
> I was able to implement a hybrid search engine that combines *lexical search 
> (edismax)* and *vector search (KNN-based embeddings)* within a single 
> request. The idea is simple:
>  * *Lexical Search* retrieves results based on text relevance.
>  * *Vector Search* retrieves results based on semantic similarity.
>  * *Hybrid Scoring* sums both scores, where a missing score (if a document 
> appears in only one search) should be treated as zero.
> This approach is working, but *there is a critical lack of documentation* on 
> how to properly return individual score components of lexical search (score1) 
> vs. vector search (score2 from cosine similarity). Right now, Solr only 
> returns the final combined score, but there is no clear way to see {*}how 
> much of that score comes from lexical search vs. vector search{*}. This is 
> essential for debugging and for fine-tuning ranking strategies.
>  
> I have implemented the following logic using Python:
> {code:java}
> def hybrid_search(query, top_k=10):
>     embedding = np.array(embed([query]), dtype=np.float32
>     embedding = list(embedding[0])
>     lxq= rf"""{{!type=edismax 
>                 qf='text'
>                 q.op=OR
>                 tie=0.1
>                 bq=''
>                 bf=''
>                 boost=''
>             }}({query})"""
>     solr_query = {"params": {
>         "q": "{!bool filter=$retrievalStage must=$rankingStage}",
>         "rankingStage": 
> "{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))",
>         "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", 
> # Union
>         "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)",
>         "lexicalQuery": lxq,
>         "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}",
>         "fl": "text",
>         "rows": top_k,
>         "fq": [""],
>         "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}",
>         "rqq": "{!frange l=$cutoff}query($rankingStage)",
>         "sort": "score desc",
>     }}
>     response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query)
>     response = response.json()
>     return response {code}
> h3. *Issues & Missing Documentation*
>  # *No Way to Retrieve Individual Scores in a Hybrid Search*
> There is no clear documentation on how to return:
>  ** The *lexical search score* separately.
>  ** The *vector search score* separately.
>  ** The *final combined score* (which Solr already provides).
> Right now, we’re left guessing whether the sum of these scores works as 
> expected, making debugging and tuning unnecessarily difficult.
>  # *No Clear Way to Implement Cutoff Logic in Solr*
> In a hybrid search, I need to filter out results that don’t meet a {*}minimum 
> score threshold{*}. Right now, I have to implement this in Python, {*}which 
> defeats the purpose of using Solr for ranking in the first place{*}.
>  ** How can we enforce a {*}score-based cutoff directly in Solr{*}, without 
> external filtering?
>  ** The {{{!frange}}} function is mentioned in the documentation but lacks 
> {*}clear examples on how to apply it to hybrid search{*}.
> h3. *Feature Request / Documentation Improvement*
>  * *Provide a way to return individual scores for lexical and vector search 
> in the response.* This should be as simple as adding fields like 
> {{{}fl=score,lexical_score,vector_score{}}}.
>  * *Clarify how to apply cutoff logic in a hybrid search.* This is an 
> essential ranking mechanism, and yet, there’s little guidance on how to do 
> this efficiently within Solr itself.
> Looking forward to a response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Updated] (SOLR-17679) Request for Documentation on Hybrid Lexical and Vector Search with Score Breakdown and Cutoff Logic

Reply via email to