Re: TopK strategy for vectorized chunks in Solr

Alessandro Benedetti Fri, 24 Oct 2025 02:04:29 -0700

Just to let you know that in the next few weeks I'll resume my work on
nested vectors in Solr and include Hoss's observations in my contribution.


Stay tuned!
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr Chair of PMC*

e-mail: [email protected]


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Thu, 23 Oct 2025 at 00:11, Rahul Goswami <[email protected]> wrote:

> Guillaume Hoss, Tom,
> Thank you for your inputs.
>
> Tom,
> Thanks for the detailed explanation. I am also going over your talk as we
> speak. A follow up to your index design...Curious to know what advantage
> does the nested doc design provide in this case?
>
> If my understanding is correct, had the parent and child docs been
> unrelated docs connected by a "secondary key" in the child docs (say
> "ParentId"), you could still have used the "join" parser and achieved the
> same result as the "parent" parser, no?
>
> Especially since the JIRA for getting topK parent hits is still in progress
> (https://issues.apache.org/jira/browse/SOLR-17736).
>
> How are you handling any changes to your child docs? (Since you'd need to
> reindex the whole block I assume? )
>
> Thanks,
> Rahul
>
> On Mon, Oct 6, 2025 at 2:49 PM Burgmans, Tom
> <[email protected]> wrote:
>
> > We managed to get our required flavor hybrid search working in Solr via a
> > nested index.
> >
> > The required flavor: applying both lexical search and vector search in a
> > single search call with a logical OR (a document could match pure
> > lexically, pure as vector match or both). Our documents are large enough
> > that chunking is needed and of course no duplication of results are
> allowed.
> >
> > The nested index is a construction where the parent documents are the
> > large original documents and the children are the chunks that are
> > vectorized:
> >
> > <doc>
> >         <field name="id">doc-1</field>
> >         <field name="type_s">parent</field>
> >         <field name="full_doc_title">This is the title text</field>
> >         <field name="full_doc_body">This is the full body text</field>
> >         <field name="metadata1_s">some metadata</field>
> >         <doc>
> >                 <field name="id">doc-1.1</field>
> >                 <field name="parentDoc">doc-1</field>
> >                 <field name="type_s">child</field>
> >                 <field name="chunk_body">This is the chunk body
> > text</field>
> >                 <field name="chunkoffsets_s">8123-12123</field>
> >                 <field
> > name="vector_field"><![CDATA[-0.0037859276]]></field>
> >                 <field
> name="vector_field"><![CDATA[-0.012503299]]></field>
> >                 <field
> name="vector_field"><![CDATA[0.018080892]]></field>
> >                 <field
> name="vector_field"><![CDATA[0.0024048693]]></field>
> >                 ...
> >             </doc>
> >             <doc>
> >                  <field name="id">doc-1.2</field>
> >                 <field name="parentDoc">doc-1</field>
> >                 <field name="type_s">child</field>
> >                 <field name="chunk_body">This is the body text of another
> > chunk</field
> >                 <field name="chunkoffsets_s">12200-12788</field>
> >                 <field
> > name="vector_field"><![CDATA[}[-0.0034859276]]></field>
> >                 <field
> name="vector_field"><![CDATA[0.0024048693]]></field>
> >                 <field
> name="vector_field"><![CDATA[-0.016224038]]></field>
> >                 <field
> name="vector_field"><![CDATA[0.025224038]]></field>
> >                 ...
> >         </doc>
> >   <doc>
> >   ...
> >   </doc>
> > </doc>
> >
> > This query construction searches the parents lexically and the children
> > via ANN search. The result set contain full documents only. Balancing the
> > impact of lexical vs vector happens via kwweight and vectorweight (these
> > values may change per query, depending on its nature). Note that this
> > construction doesn't include score normalization, because this is an
> > expensive operation when there are many results and moreover
> normalization
> > doesn't guarantee proper blending of relevant lexical and vector results.
> >
> > params:{
> >   uf:"* _query_",
> >   q:"{!bool filter=$hybridlogic must=$hybridscore}",
> >   hybridlogic:"{!bool should=$kwq should=$vectorq}",
> >
> >
> hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
> >   kwq:"{!type=edismax qf=\"full_doc_body full_doc_title^3\" v=$qq}",
> >   qq:"What is the income tax in New York?",
> >   vectorq:"{!parent which=\"type_s:parent\" score=max v=$childq}",
> >   childq:"{!knn f=vector_field
> > topK=10}[-0.0034859276,-0.028224038,0.0024048693,...]",
> >   kwweight:1,
> >   vectorweight:4
> > }
> >
> > This nested index is multi-purpose: for hybrid searching full documents
> > (the construction above) and for hybrid searching the chunks only (see
> > below).
> >
> > This following query construction searches the chunks both lexically and
> > via ANN search. The result set contain chunks only. This is meant for RAG
> > use cases where we're only interested in document chunks as context for
> the
> > LLM.
> >
> > params:{
> >   uf:"* _query_",
> >   q:"{!bool filter=$hybridlogic must=$hybridscore}",
> >   hybridlogic:"{!bool should=$kwq should=$vectorq}",
> >
> >
> hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
> >   kwq:"{!type=edismax qf=\"chunk_body\" v=$qq}",
> >   qq:"What is the income tax in New York?",
> >   vectorq:"{!knn f=vector_field
> > topK=10}[-0.002503299,-0.001550957,0.018080892,...]",
> >   kwweight:1,
> >   vectorweight:4
> > }
> >
> > We recently gave a presentation about this and other things at the
> > Haystack EU 2025 conference: https://www.youtube.com/watch?v=3CPa1MpnLlI
> >
> >
> > Regards, Tom
> >
> >
> >
> >
> > -----Original Message-----
> > From: Rahul Goswami <[email protected]>
> > Sent: Sunday, August 31, 2025 2:08 PM
> > To: [email protected]
> > Subject: Re: TopK strategy for vectorized chunks in Solr
> >
> > Caution, this email may be from a sender outside Wolters Kluwer. Verify
> > the sender and know the content is safe.
> >
> > Hello,
> > Floating this up again in case anyone has any insights. Thanks.
> >
> > Rahul
> >
> > On Fri, Aug 15, 2025 at 11:45 AM Rahul Goswami <[email protected]>
> > wrote:
> >
> > > Hello,
> > > A question for folks using Solr as the vector db in their solutions.
> > > As of now since Solr doesn't support parent/child or multi-valued
> > > vector field support for vector search, what are some strategies that
> > > can be used to avoid duplicates in top K results when you have
> > > vectorized chunks for the same (large) document?
> > >
> > > Would be also helpful to know how folks are doing this when storing
> > > vectors in the same docs as the lexical index vs when having the
> > > vectorized chunks in a separate index.
> > >
> > > Thanks.
> > > Rahul
> > >
> >
>

Re: TopK strategy for vectorized chunks in Solr

Reply via email to