Hi everyone, I've been running OpenSolr.com (hosted Solr) for about 10 years, and recently implemented hybrid dense vector + lexical search. Wanted to share what worked and maybe get feedback from the community.
The problem:
When combining knn vector scores (0-1 range) with edismax scores (unbounded,
often 10-50+), naive addition doesn't work. Lexical dominates every time, even
when semantically wrong.
The solution:
Saturation normalization on lexical scores:
{!func}sum(
product($vector_weight, query($vectorQuery)),
product($lexical_weight, div(query($lexicalQuery), sum(query($lexicalQuery),
$k)))
)
The div(score, score + k) maps any lexical score to 0-1 range. With k=10:
lexical 10 β 0.50
lexical 20 β 0.67
lexical 50 β 0.83
Now vectors and lexical compete fairly.
Other pieces:
Using paraphrase-multilingual-MiniLM-L12-v2 for embeddings (CPU, no GPU)
mm="3<90% 5<75% 8<60% 12<50%" for minimum match tuning
Emoji queries via emoji.demojize() before embedding (π₯ β "fire")
Live demos with debug inspector:
I exposed the full debugQuery output so you can see exactly what's happening:
Cross-lingual (ENβRO): https://opensolr.com/search/dedeman?q=pellet+heater
Emoji search: https://opensolr.com/search/vector?q=π₯
<https://opensolr.com/search/vector?q=%F0%9F%94%A5>
Semantic matching:
https://opensolr.com/search/peilishop?q=stuff+to+wear+around+my+neck
Click the Debug button on any search to see params, parsed query, and explain
output.
Questions for the community:
Anyone else doing hybrid scoring differently? Curious about other normalization
approaches.
Is there interest in a more detailed write-up on the mm tuning for hybrid
scenarios?
Any gotchas with knn + function queries I should watch for at scale?
Happy to share more implementation details if useful.
Cheers,
Chip
OpenSolr.com
smime.p7s
Description: S/MIME cryptographic signature
