Hi Solr community,

We are looking to add hybrid search capabilities (dense vector KNN +
lexical) to our existing production Solr setup and would like guidance from
people who have done a similar migration.

*OUR CURRENT SETUP *

   - Apache Solr 9.6.1, SolrCloud
   - Large product catalogue (~2-2.5M documents per shard  , total 65
   shards)
   - Fully lexical search today: edismax on text fields (product title,
   attributes, location etc.) with custom scoring, boosting, and faceting
   - Search has been in production for years and is well-tuned


*WHAT WE WANT TO ADD *

We want to introduce product image embeddings (SigLIP2, 768-dim) alongside
our existing lexical search to enable hybrid retrieval. The idea is:

   - At index time: generate a 768-dim embedding per product and store it
   as a DenseVectorField (img_emb)
   - At query time: encode the text query into a vector, then combine the
   KNN result with the existing edismax lexical result


*SPECIFIC QUESTIONS*

1. SCHEMA MIGRATION
   What is the safest way to add a DenseVectorField to an existing
production schema without downtime? Do we need a full re-index, or can we
add the field and backfill embeddings incrementally       on existing
documents? (considering 200 millions of data)

2. SHOULD WE UPGRADE TO SOLR 10 FIRST?
   We are currently on Solr 9.6.1. Before investing in hybrid search, we
want to know whether it is worth upgrading to Solr 10 first.

   - Does Solr 10 offer significantly better hybrid search support compared
   to 9.6 (e.g. better KNN performance, new ranking primitives, RRF
   support,cleaner query syntax)?
   - Are there any known breaking changes in Solr 10 that would affect an
   existing large-scale edismax-based setup?
   - Is Solr 9.6.1 good enough to build a solid hybrid search system on,and
   the upgrade can be done later independently?


3. INDEXING EMBEDDINGS AT SCALE
ow to handle full indexing(200 millions) and incremental indexing(

   - Are there known performance or memory issues when indexing large
   DenseVectorFields alongside existing text fields?
   - Any guidance on HNSW index build time, segment merge behaviour, or RAM
   requirements when adding a dense vector field to an existing collection of
   this size?


4. KEEPING EXISTING LEXICAL BEHAVIOUR INTACT
   Our current lexical ranking uses custom boosting, function queries, and
   faceting that have been tuned over years. We are concerned that
introducing
   hybrid scoring will disturb this.

   - Is it possible to run hybrid search on a subset of queries (e.g. only
   when a vector is available) while falling back cleanly to pure lexical for
   others, all within the same request handler?
   - What is the recommended way to weight the KNN signal relative to an
   already-complex lexical score, without breaking the existing tuning?


5. SCORE COMBINATION

   - What is the recommended approach to normalise the two scores before
   combining them in Solr? Is there an alternative Solr function query that
   handles this, or should normalisation happen in the application layer
   before the query is sent?


*SUMMARY OF WHAT WE ARE TRYING TO DECIDE *

Essentially we are trying to figure out the right order of operations:

  Option A: Add hybrid search on Solr 9.6 now, upgrade to Solr 10 later
  Option B: Upgrade to Solr 10 first, then build hybrid search on it
  Option C: Something else entirely that we have not considered

If Solr 10 brings meaningful improvements specifically for hybrid search
(RRF, better KNN, cleaner query API), we would rather do the upgrade first.
If 9.6 is production-ready for this use case and 10 does not change the
hybrid search story significantly, we will proceed on 9.6 and upgrade
independently later.

Any guidance, production experiences, or documentation pointers from people
who have
done this migration in production would be hugely appreciated.

Thank you!
Sunny Singh
IndiaMART

Reply via email to