Just to let you know that in the next few weeks I'll resume my work on nested vectors in Solr and include Hoss's observations in my contribution.
Stay tuned! -------------------------- *Alessandro Benedetti* Director @ Sease Ltd. *Apache Lucene/Solr Committer* *Apache Solr Chair of PMC* e-mail: [email protected] *Sease* - Information Retrieval Applied Consulting | Training | Open Source Website: Sease.io <http://sease.io/> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter <https://twitter.com/seaseltd> | Youtube <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github <https://github.com/seaseltd> On Thu, 23 Oct 2025 at 00:11, Rahul Goswami <[email protected]> wrote: > Guillaume Hoss, Tom, > Thank you for your inputs. > > Tom, > Thanks for the detailed explanation. I am also going over your talk as we > speak. A follow up to your index design...Curious to know what advantage > does the nested doc design provide in this case? > > If my understanding is correct, had the parent and child docs been > unrelated docs connected by a "secondary key" in the child docs (say > "ParentId"), you could still have used the "join" parser and achieved the > same result as the "parent" parser, no? > > Especially since the JIRA for getting topK parent hits is still in progress > (https://issues.apache.org/jira/browse/SOLR-17736). > > How are you handling any changes to your child docs? (Since you'd need to > reindex the whole block I assume? ) > > Thanks, > Rahul > > On Mon, Oct 6, 2025 at 2:49 PM Burgmans, Tom > <[email protected]> wrote: > > > We managed to get our required flavor hybrid search working in Solr via a > > nested index. > > > > The required flavor: applying both lexical search and vector search in a > > single search call with a logical OR (a document could match pure > > lexically, pure as vector match or both). Our documents are large enough > > that chunking is needed and of course no duplication of results are > allowed. > > > > The nested index is a construction where the parent documents are the > > large original documents and the children are the chunks that are > > vectorized: > > > > <doc> > > <field name="id">doc-1</field> > > <field name="type_s">parent</field> > > <field name="full_doc_title">This is the title text</field> > > <field name="full_doc_body">This is the full body text</field> > > <field name="metadata1_s">some metadata</field> > > <doc> > > <field name="id">doc-1.1</field> > > <field name="parentDoc">doc-1</field> > > <field name="type_s">child</field> > > <field name="chunk_body">This is the chunk body > > text</field> > > <field name="chunkoffsets_s">8123-12123</field> > > <field > > name="vector_field"><![CDATA[-0.0037859276]]></field> > > <field > name="vector_field"><![CDATA[-0.012503299]]></field> > > <field > name="vector_field"><![CDATA[0.018080892]]></field> > > <field > name="vector_field"><![CDATA[0.0024048693]]></field> > > ... > > </doc> > > <doc> > > <field name="id">doc-1.2</field> > > <field name="parentDoc">doc-1</field> > > <field name="type_s">child</field> > > <field name="chunk_body">This is the body text of another > > chunk</field > > <field name="chunkoffsets_s">12200-12788</field> > > <field > > name="vector_field"><![CDATA[}[-0.0034859276]]></field> > > <field > name="vector_field"><![CDATA[0.0024048693]]></field> > > <field > name="vector_field"><![CDATA[-0.016224038]]></field> > > <field > name="vector_field"><![CDATA[0.025224038]]></field> > > ... > > </doc> > > <doc> > > ... > > </doc> > > </doc> > > > > This query construction searches the parents lexically and the children > > via ANN search. The result set contain full documents only. Balancing the > > impact of lexical vs vector happens via kwweight and vectorweight (these > > values may change per query, depending on its nature). Note that this > > construction doesn't include score normalization, because this is an > > expensive operation when there are many results and moreover > normalization > > doesn't guarantee proper blending of relevant lexical and vector results. > > > > params:{ > > uf:"* _query_", > > q:"{!bool filter=$hybridlogic must=$hybridscore}", > > hybridlogic:"{!bool should=$kwq should=$vectorq}", > > > > > hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))", > > kwq:"{!type=edismax qf=\"full_doc_body full_doc_title^3\" v=$qq}", > > qq:"What is the income tax in New York?", > > vectorq:"{!parent which=\"type_s:parent\" score=max v=$childq}", > > childq:"{!knn f=vector_field > > topK=10}[-0.0034859276,-0.028224038,0.0024048693,...]", > > kwweight:1, > > vectorweight:4 > > } > > > > This nested index is multi-purpose: for hybrid searching full documents > > (the construction above) and for hybrid searching the chunks only (see > > below). > > > > This following query construction searches the chunks both lexically and > > via ANN search. The result set contain chunks only. This is meant for RAG > > use cases where we're only interested in document chunks as context for > the > > LLM. > > > > params:{ > > uf:"* _query_", > > q:"{!bool filter=$hybridlogic must=$hybridscore}", > > hybridlogic:"{!bool should=$kwq should=$vectorq}", > > > > > hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))", > > kwq:"{!type=edismax qf=\"chunk_body\" v=$qq}", > > qq:"What is the income tax in New York?", > > vectorq:"{!knn f=vector_field > > topK=10}[-0.002503299,-0.001550957,0.018080892,...]", > > kwweight:1, > > vectorweight:4 > > } > > > > We recently gave a presentation about this and other things at the > > Haystack EU 2025 conference: https://www.youtube.com/watch?v=3CPa1MpnLlI > > > > > > Regards, Tom > > > > > > > > > > -----Original Message----- > > From: Rahul Goswami <[email protected]> > > Sent: Sunday, August 31, 2025 2:08 PM > > To: [email protected] > > Subject: Re: TopK strategy for vectorized chunks in Solr > > > > Caution, this email may be from a sender outside Wolters Kluwer. Verify > > the sender and know the content is safe. > > > > Hello, > > Floating this up again in case anyone has any insights. Thanks. > > > > Rahul > > > > On Fri, Aug 15, 2025 at 11:45 AM Rahul Goswami <[email protected]> > > wrote: > > > > > Hello, > > > A question for folks using Solr as the vector db in their solutions. > > > As of now since Solr doesn't support parent/child or multi-valued > > > vector field support for vector search, what are some strategies that > > > can be used to avoid duplicates in top K results when you have > > > vectorized chunks for the same (large) document? > > > > > > Would be also helpful to know how folks are doing this when storing > > > vectors in the same docs as the lexical index vs when having the > > > vectorized chunks in a separate index. > > > > > > Thanks. > > > Rahul > > > > > >
