[ https://issues.apache.org/jira/browse/SOLR-17679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Khaled Alkhouli updated SOLR-17679: ----------------------------------- Description: Hello Apache Solr team, I was able to implement a hybrid search engine that combines *lexical search (edismax)* and *vector search (KNN-based embeddings)* within a single request. The idea is simple: * *Lexical Search* retrieves results based on text relevance. * *Vector Search* retrieves results based on semantic similarity. * *Hybrid Scoring* sums both scores, where a missing score (if a document appears in only one search) should be treated as zero. This approach is working, but *there is a critical lack of documentation* on how to properly return individual score components of lexical search (score1) vs. vector search (score2 from cosine similarity). Right now, Solr only returns the final combined score, but there is no clear way to see {*}how much of that score comes from lexical search vs. vector search{*}. This is essential for debugging and for fine-tuning ranking strategies. I have implemented the following logic using Python: {code:java} def hybrid_search(query, top_k=10): embedding = np.array(embed([query]), dtype=np.float32 embedding = list(embedding[0]) lxq= rf"""{{!type=edismax qf='text' q.op=OR tie=0.1 bq='' bf='' boost='' }}({query})""" solr_query = {"params": { "q": "{!bool filter=$retrievalStage must=$rankingStage}", "rankingStage": "{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))", "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", # Union "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)", "lexicalQuery": lxq, "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}", "fl": "text", "rows": top_k, "fq": [""], "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}", "rqq": "{!frange l=$cutoff}query($rankingStage)", "sort": "score desc", }} response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query) response = response.json() return response {code} h3. *Issues & Missing Documentation* # *No Way to Retrieve Individual Scores in a Hybrid Search* There is no clear documentation on how to return: ** The *lexical search score* separately. ** The *vector search score* separately. ** The *final combined score* (which Solr already provides). Right now, we’re left guessing whether the sum of these scores works as expected, making debugging and tuning unnecessarily difficult. # *No Clear Way to Implement Cutoff Logic in Solr* In a hybrid search, I need to filter out results that don’t meet a {*}minimum score threshold{*}. Right now, I have to implement this in Python, {*}which defeats the purpose of using Solr for ranking in the first place{*}. ** How can we enforce a {*}score-based cutoff directly in Solr{*}, without external filtering? ** The {{{!frange}}} function is mentioned in the documentation but lacks {*}clear examples on how to apply it to hybrid search{*}. h3. *Feature Request / Documentation Improvement* * *Provide a way to return individual scores for lexical and vector search in the response.* This should be as simple as adding fields like {{{}fl=score,lexical_score,vector_score{}}}. * *Clarify how to apply cutoff logic in a hybrid search.* This is an essential ranking mechanism, and yet, there’s little guidance on how to do this efficiently within Solr itself. Looking forward to a response. was: Hello Apache Solr team, I am building a hybrid search engine that combines lexical search (traditional keyword-based search) and vector search (semantic search using embeddings) in a single request. I’m aiming to achieve the following in one request: # *Lexical Search:* Using edismax with specified fields and weights. # *Vector Search:* Using K-Nearest Neighbors (KNN) based on embeddings. # *Hybrid Score Combination:* The final score is the sum of the normalized lexical score and the vector search score. If a document appears in only one search, the other score should be treated as zero. I have implemented the following logic using Python: {code:java} def hybrid_search(query, top_k=10): embedding = np.array(embed([query]), dtype=np.float32 embedding = list(embedding[0]) lxq= rf"""{{!type=edismax qf='text' q.op=OR tie=0.1 bq='' bf='' boost='' }}({query})""" solr_query = {"params": { "q": "{!bool filter=$retrievalStage must=$rankingStage}", "rankingStage": "{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))", "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", # Union "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)", "lexicalQuery": lxq, "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}", "fl": "text", "rows": top_k, "fq": [""], "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}", "rqq": "{!frange l=$cutoff}query($rankingStage)", "sort": "score desc", }} response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query) response = response.json() return response {code} The response returns documents with a combined score, which I assume is the addition of: * *Lexical Search Score:* Normalized between 0 and 1. * *Vector Search Score:* Already bounded between 0 and 1. If a document is present in one search but not the other, the score from the missing part is added as zero. Attached is an image of the current output. h3. *Request:* I would like documentation or guidance on the following: # {*}View and Return Individual Scores:{*}{*}{*}1.1 Lexical search score 1.2 Vector search score 1.3 Final combined score (already retrieved) I would like to display all three scores in the response together for each document. # *Cutoff Logic:* I am using a Python function to calculate a cutoff threshold based on the scores. Is it possible to implement this cutoff directly in Solr so that only documents that pass a certain threshold are returned? If so, how can I achieve this within Solr’s query syntax, without relying on external Python logic? How can I retrieve the following scores in the same request? * I appreciate any help or documentation that can assist with: * Returning separate scores for lexical and vector queries. * Implementing cutoff logic natively in Solr. Thank you! Issue Type: Improvement (was: Task) > Request for Documentation on Hybrid Lexical and Vector Search with Score > Breakdown and Cutoff Logic > --------------------------------------------------------------------------------------------------- > > Key: SOLR-17679 > URL: https://issues.apache.org/jira/browse/SOLR-17679 > Project: Solr > Issue Type: Improvement > Components: search > Affects Versions: 9.6.1 > Reporter: Khaled Alkhouli > Priority: Major > Labels: hybrid-search, search, solr, vector-based-search > Attachments: Screenshot from 2025-02-20 16-31-48.png > > > Hello Apache Solr team, > I was able to implement a hybrid search engine that combines *lexical search > (edismax)* and *vector search (KNN-based embeddings)* within a single > request. The idea is simple: > * *Lexical Search* retrieves results based on text relevance. > * *Vector Search* retrieves results based on semantic similarity. > * *Hybrid Scoring* sums both scores, where a missing score (if a document > appears in only one search) should be treated as zero. > This approach is working, but *there is a critical lack of documentation* on > how to properly return individual score components of lexical search (score1) > vs. vector search (score2 from cosine similarity). Right now, Solr only > returns the final combined score, but there is no clear way to see {*}how > much of that score comes from lexical search vs. vector search{*}. This is > essential for debugging and for fine-tuning ranking strategies. > > I have implemented the following logic using Python: > {code:java} > def hybrid_search(query, top_k=10): > embedding = np.array(embed([query]), dtype=np.float32 > embedding = list(embedding[0]) > lxq= rf"""{{!type=edismax > qf='text' > q.op=OR > tie=0.1 > bq='' > bf='' > boost='' > }}({query})""" > solr_query = {"params": { > "q": "{!bool filter=$retrievalStage must=$rankingStage}", > "rankingStage": > "{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))", > "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", > # Union > "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)", > "lexicalQuery": lxq, > "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}", > "fl": "text", > "rows": top_k, > "fq": [""], > "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}", > "rqq": "{!frange l=$cutoff}query($rankingStage)", > "sort": "score desc", > }} > response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query) > response = response.json() > return response {code} > h3. *Issues & Missing Documentation* > # *No Way to Retrieve Individual Scores in a Hybrid Search* > There is no clear documentation on how to return: > ** The *lexical search score* separately. > ** The *vector search score* separately. > ** The *final combined score* (which Solr already provides). > Right now, we’re left guessing whether the sum of these scores works as > expected, making debugging and tuning unnecessarily difficult. > # *No Clear Way to Implement Cutoff Logic in Solr* > In a hybrid search, I need to filter out results that don’t meet a {*}minimum > score threshold{*}. Right now, I have to implement this in Python, {*}which > defeats the purpose of using Solr for ranking in the first place{*}. > ** How can we enforce a {*}score-based cutoff directly in Solr{*}, without > external filtering? > ** The {{{!frange}}} function is mentioned in the documentation but lacks > {*}clear examples on how to apply it to hybrid search{*}. > h3. *Feature Request / Documentation Improvement* > * *Provide a way to return individual scores for lexical and vector search > in the response.* This should be as simple as adding fields like > {{{}fl=score,lexical_score,vector_score{}}}. > * *Clarify how to apply cutoff logic in a hybrid search.* This is an > essential ranking mechanism, and yet, there’s little guidance on how to do > this efficiently within Solr itself. > Looking forward to a response. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org