Re: TopK strategy for vectorized chunks in Solr

Guillaume Fri, 24 Oct 2025 05:52:17 -0700

That's great news, Alessandro!

I can't wait to try out these future developments!


Thanks again for your work on this area of Solr.

Guillaume

Le ven. 24 oct. 2025 à 11:04, Alessandro Benedetti <[email protected]> a
écrit :

> Just to let you know that in the next few weeks I'll resume my work on
> nested vectors in Solr and include Hoss's observations in my contribution.
>
> Stay tuned!
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr Chair of PMC*
>
> e-mail: [email protected]
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Thu, 23 Oct 2025 at 00:11, Rahul Goswami <[email protected]> wrote:
>
> > Guillaume Hoss, Tom,
> > Thank you for your inputs.
> >
> > Tom,
> > Thanks for the detailed explanation. I am also going over your talk as we
> > speak. A follow up to your index design...Curious to know what advantage
> > does the nested doc design provide in this case?
> >
> > If my understanding is correct, had the parent and child docs been
> > unrelated docs connected by a "secondary key" in the child docs (say
> > "ParentId"), you could still have used the "join" parser and achieved the
> > same result as the "parent" parser, no?
> >
> > Especially since the JIRA for getting topK parent hits is still in
> progress
> > (https://issues.apache.org/jira/browse/SOLR-17736).
> >
> > How are you handling any changes to your child docs? (Since you'd need to
> > reindex the whole block I assume? )
> >
> > Thanks,
> > Rahul
> >
> > On Mon, Oct 6, 2025 at 2:49 PM Burgmans, Tom
> > <[email protected]> wrote:
> >
> > > We managed to get our required flavor hybrid search working in Solr
> via a
> > > nested index.
> > >
> > > The required flavor: applying both lexical search and vector search in
> a
> > > single search call with a logical OR (a document could match pure
> > > lexically, pure as vector match or both). Our documents are large
> enough
> > > that chunking is needed and of course no duplication of results are
> > allowed.
> > >
> > > The nested index is a construction where the parent documents are the
> > > large original documents and the children are the chunks that are
> > > vectorized:
> > >
> > > <doc>
> > >         <field name="id">doc-1</field>
> > >         <field name="type_s">parent</field>
> > >         <field name="full_doc_title">This is the title text</field>
> > >         <field name="full_doc_body">This is the full body text</field>
> > >         <field name="metadata1_s">some metadata</field>
> > >         <doc>
> > >                 <field name="id">doc-1.1</field>
> > >                 <field name="parentDoc">doc-1</field>
> > >                 <field name="type_s">child</field>
> > >                 <field name="chunk_body">This is the chunk body
> > > text</field>
> > >                 <field name="chunkoffsets_s">8123-12123</field>
> > >                 <field
> > > name="vector_field"><![CDATA[-0.0037859276]]></field>
> > >                 <field
> > name="vector_field"><![CDATA[-0.012503299]]></field>
> > >                 <field
> > name="vector_field"><![CDATA[0.018080892]]></field>
> > >                 <field
> > name="vector_field"><![CDATA[0.0024048693]]></field>
> > >                 ...
> > >             </doc>
> > >             <doc>
> > >                  <field name="id">doc-1.2</field>
> > >                 <field name="parentDoc">doc-1</field>
> > >                 <field name="type_s">child</field>
> > >                 <field name="chunk_body">This is the body text of
> another
> > > chunk</field
> > >                 <field name="chunkoffsets_s">12200-12788</field>
> > >                 <field
> > > name="vector_field"><![CDATA[}[-0.0034859276]]></field>
> > >                 <field
> > name="vector_field"><![CDATA[0.0024048693]]></field>
> > >                 <field
> > name="vector_field"><![CDATA[-0.016224038]]></field>
> > >                 <field
> > name="vector_field"><![CDATA[0.025224038]]></field>
> > >                 ...
> > >         </doc>
> > >   <doc>
> > >   ...
> > >   </doc>
> > > </doc>
> > >
> > > This query construction searches the parents lexically and the children
> > > via ANN search. The result set contain full documents only. Balancing
> the
> > > impact of lexical vs vector happens via kwweight and vectorweight
> (these
> > > values may change per query, depending on its nature). Note that this
> > > construction doesn't include score normalization, because this is an
> > > expensive operation when there are many results and moreover
> > normalization
> > > doesn't guarantee proper blending of relevant lexical and vector
> results.
> > >
> > > params:{
> > >   uf:"* _query_",
> > >   q:"{!bool filter=$hybridlogic must=$hybridscore}",
> > >   hybridlogic:"{!bool should=$kwq should=$vectorq}",
> > >
> > >
> >
> hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
> > >   kwq:"{!type=edismax qf=\"full_doc_body full_doc_title^3\" v=$qq}",
> > >   qq:"What is the income tax in New York?",
> > >   vectorq:"{!parent which=\"type_s:parent\" score=max v=$childq}",
> > >   childq:"{!knn f=vector_field
> > > topK=10}[-0.0034859276,-0.028224038,0.0024048693,...]",
> > >   kwweight:1,
> > >   vectorweight:4
> > > }
> > >
> > > This nested index is multi-purpose: for hybrid searching full documents
> > > (the construction above) and for hybrid searching the chunks only (see
> > > below).
> > >
> > > This following query construction searches the chunks both lexically
> and
> > > via ANN search. The result set contain chunks only. This is meant for
> RAG
> > > use cases where we're only interested in document chunks as context for
> > the
> > > LLM.
> > >
> > > params:{
> > >   uf:"* _query_",
> > >   q:"{!bool filter=$hybridlogic must=$hybridscore}",
> > >   hybridlogic:"{!bool should=$kwq should=$vectorq}",
> > >
> > >
> >
> hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
> > >   kwq:"{!type=edismax qf=\"chunk_body\" v=$qq}",
> > >   qq:"What is the income tax in New York?",
> > >   vectorq:"{!knn f=vector_field
> > > topK=10}[-0.002503299,-0.001550957,0.018080892,...]",
> > >   kwweight:1,
> > >   vectorweight:4
> > > }
> > >
> > > We recently gave a presentation about this and other things at the
> > > Haystack EU 2025 conference:
> https://www.youtube.com/watch?v=3CPa1MpnLlI
> > >
> > >
> > > Regards, Tom
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Rahul Goswami <[email protected]>
> > > Sent: Sunday, August 31, 2025 2:08 PM
> > > To: [email protected]
> > > Subject: Re: TopK strategy for vectorized chunks in Solr
> > >
> > > Caution, this email may be from a sender outside Wolters Kluwer. Verify
> > > the sender and know the content is safe.
> > >
> > > Hello,
> > > Floating this up again in case anyone has any insights. Thanks.
> > >
> > > Rahul
> > >
> > > On Fri, Aug 15, 2025 at 11:45 AM Rahul Goswami <[email protected]>
> > > wrote:
> > >
> > > > Hello,
> > > > A question for folks using Solr as the vector db in their solutions.
> > > > As of now since Solr doesn't support parent/child or multi-valued
> > > > vector field support for vector search, what are some strategies that
> > > > can be used to avoid duplicates in top K results when you have
> > > > vectorized chunks for the same (large) document?
> > > >
> > > > Would be also helpful to know how folks are doing this when storing
> > > > vectors in the same docs as the lexical index vs when having the
> > > > vectorized chunks in a separate index.
> > > >
> > > > Thanks.
> > > > Rahul
> > > >
> > >
> >
>

Re: TopK strategy for vectorized chunks in Solr

Reply via email to