[ https://issues.apache.org/jira/browse/SOLR-17679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Khaled Alkhouli updated SOLR-17679: ----------------------------------- Priority: Minor (was: Major) > Request for Documentation/Feature Improvement on Hybrid Lexical and Vector > Search with Score Breakdown and Cutoff Logic > ----------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-17679 > URL: https://issues.apache.org/jira/browse/SOLR-17679 > Project: Solr > Issue Type: Improvement > Components: search > Affects Versions: 9.6.1 > Reporter: Khaled Alkhouli > Priority: Minor > Labels: hybrid-search, search, solr, vector-based-search > Attachments: Screenshot from 2025-02-20 16-31-48.png > > > Hello Apache Solr team, > I was able to implement a hybrid search engine that combines *lexical search > (edismax)* and *vector search (KNN-based embeddings)* within a single > request. The idea is simple: > * *Lexical Search* retrieves results based on text relevance. > * *Vector Search* retrieves results based on semantic similarity. > * *Hybrid Scoring* sums both scores, where a missing score (if a document > appears in only one search) should be treated as zero. > This approach is working, but *there is a critical lack of documentation* on > how to properly return individual score components of lexical search (score1) > vs. vector search (score2 from cosine similarity). Right now, Solr only > returns the final combined score, but there is no clear way to see {*}how > much of that score comes from lexical search vs. vector search{*}. This is > essential for debugging and for fine-tuning ranking strategies. > > I have implemented the following logic using Python: > {code:java} > def hybrid_search(query, top_k=10): > embedding = np.array(embed([query]), dtype=np.float32 > embedding = list(embedding[0]) > lxq= rf"""{{!type=edismax > qf='text' > q.op=OR > tie=0.1 > bq='' > bf='' > boost='' > }}({query})""" > solr_query = {"params": { > "q": "{!bool filter=$retrievalStage must=$rankingStage}", > "rankingStage": > "{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))", > "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", > # Union > "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)", > "lexicalQuery": lxq, > "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}", > "fl": "text", > "rows": top_k, > "fq": [""], > "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}", > "rqq": "{!frange l=$cutoff}query($rankingStage)", > "sort": "score desc", > }} > response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query) > response = response.json() > return response {code} > h3. *Issues & Missing Documentation* > # *No Way to Retrieve Individual Scores in a Hybrid Search* > There is no clear documentation on how to return: > * > ** The *lexical search score* separately. > ** The *vector search score* separately. > ** The *final combined score* (which Solr already provides). > Right now, we’re left guessing whether the sum of these scores works as > expected, making debugging and tuning unnecessarily difficult. > # *No Clear Way to Implement Cutoff Logic in Solr* > In a hybrid search, I need to filter out results that don’t meet a {*}minimum > score threshold{*}. Right now, I have to implement this in Python, {*}which > defeats the purpose of using Solr for ranking in the first place{*}. > * > ** How can we enforce a {*}score-based cutoff directly in Solr{*}, without > external filtering? > ** The \{!frange} function is mentioned in the documentation but lacks > {*}clear examples on how to apply it to hybrid search{*}. > h3. *Feature Request / Documentation Improvement* > * *Provide a way to return individual scores for lexical and vector search > in the response.* This should be as simple as adding fields like > {{{}fl=score,lexical_score,vector_score{}}}. > * *Clarify how to apply cutoff logic in a hybrid search.* This is an > essential ranking mechanism, and yet, there’s little guidance on how to do > this efficiently within Solr itself. > Looking forward to a response. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org