Hi Solr community, We are looking to add hybrid search capabilities (dense vector KNN + lexical) to our existing production Solr setup and would like guidance from people who have done a similar migration.
*OUR CURRENT SETUP * - Apache Solr 9.6.1, SolrCloud - Large product catalogue (~2-2.5M documents per shard , total 65 shards) - Fully lexical search today: edismax on text fields (product title, attributes, location etc.) with custom scoring, boosting, and faceting - Search has been in production for years and is well-tuned *WHAT WE WANT TO ADD * We want to introduce product image embeddings (SigLIP2, 768-dim) alongside our existing lexical search to enable hybrid retrieval. The idea is: - At index time: generate a 768-dim embedding per product and store it as a DenseVectorField (img_emb) - At query time: encode the text query into a vector, then combine the KNN result with the existing edismax lexical result *SPECIFIC QUESTIONS* 1. SCHEMA MIGRATION What is the safest way to add a DenseVectorField to an existing production schema without downtime? Do we need a full re-index, or can we add the field and backfill embeddings incrementally on existing documents? (considering 200 millions of data) 2. SHOULD WE UPGRADE TO SOLR 10 FIRST? We are currently on Solr 9.6.1. Before investing in hybrid search, we want to know whether it is worth upgrading to Solr 10 first. - Does Solr 10 offer significantly better hybrid search support compared to 9.6 (e.g. better KNN performance, new ranking primitives, RRF support,cleaner query syntax)? - Are there any known breaking changes in Solr 10 that would affect an existing large-scale edismax-based setup? - Is Solr 9.6.1 good enough to build a solid hybrid search system on,and the upgrade can be done later independently? 3. INDEXING EMBEDDINGS AT SCALE ow to handle full indexing(200 millions) and incremental indexing( - Are there known performance or memory issues when indexing large DenseVectorFields alongside existing text fields? - Any guidance on HNSW index build time, segment merge behaviour, or RAM requirements when adding a dense vector field to an existing collection of this size? 4. KEEPING EXISTING LEXICAL BEHAVIOUR INTACT Our current lexical ranking uses custom boosting, function queries, and faceting that have been tuned over years. We are concerned that introducing hybrid scoring will disturb this. - Is it possible to run hybrid search on a subset of queries (e.g. only when a vector is available) while falling back cleanly to pure lexical for others, all within the same request handler? - What is the recommended way to weight the KNN signal relative to an already-complex lexical score, without breaking the existing tuning? 5. SCORE COMBINATION - What is the recommended approach to normalise the two scores before combining them in Solr? Is there an alternative Solr function query that handles this, or should normalisation happen in the application layer before the query is sent? *SUMMARY OF WHAT WE ARE TRYING TO DECIDE * Essentially we are trying to figure out the right order of operations: Option A: Add hybrid search on Solr 9.6 now, upgrade to Solr 10 later Option B: Upgrade to Solr 10 first, then build hybrid search on it Option C: Something else entirely that we have not considered If Solr 10 brings meaningful improvements specifically for hybrid search (RRF, better KNN, cleaner query API), we would rather do the upgrade first. If 9.6 is production-ready for this use case and 10 does not change the hybrid search story significantly, we will proceed on 9.6 and upgrade independently later. Any guidance, production experiences, or documentation pointers from people who have done this migration in production would be hugely appreciated. Thank you! Sunny Singh IndiaMART
