Hi everyone,

I've been running OpenSolr.com (hosted Solr) for about 10 years, and recently 
implemented hybrid dense vector + lexical search. Wanted to share what worked 
and maybe get feedback from the community.

The problem:

When combining knn vector scores (0-1 range) with edismax scores (unbounded, 
often 10-50+), naive addition doesn't work. Lexical dominates every time, even 
when semantically wrong.

The solution:

Saturation normalization on lexical scores:

   {!func}sum(

  product($vector_weight, query($vectorQuery)),
  product($lexical_weight, div(query($lexicalQuery), sum(query($lexicalQuery), 
$k)))
)
The div(score, score + k) maps any lexical score to 0-1 range. With k=10:

lexical 10 β†’ 0.50
lexical 20 β†’ 0.67
lexical 50 β†’ 0.83
Now vectors and lexical compete fairly.

Other pieces:

Using paraphrase-multilingual-MiniLM-L12-v2 for embeddings (CPU, no GPU)
mm="3<90% 5<75% 8<60% 12<50%" for minimum match tuning
Emoji queries via emoji.demojize() before embedding (πŸ”₯ β†’ "fire")
Live demos with debug inspector:

I exposed the full debugQuery output so you can see exactly what's happening:

Cross-lingual (EN→RO): https://opensolr.com/search/dedeman?q=pellet+heater
Emoji search: https://opensolr.com/search/vector?q=πŸ”₯ 
<https://opensolr.com/search/vector?q=%F0%9F%94%A5>
Semantic matching: 
https://opensolr.com/search/peilishop?q=stuff+to+wear+around+my+neck
Click the Debug button on any search to see params, parsed query, and explain 
output.

Questions for the community:

Anyone else doing hybrid scoring differently? Curious about other normalization 
approaches.
Is there interest in a more detailed write-up on the mm tuning for hybrid 
scenarios?
Any gotchas with knn + function queries I should watch for at scale?
Happy to share more implementation details if useful.

Cheers, 

Chip 

OpenSolr.com

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to