Re: TopK strategy for vectorized chunks in Solr

Rahul Goswami Wed, 22 Oct 2025 15:12:08 -0700

Guillaume Hoss, Tom,
Thank you for your inputs.

Tom,
Thanks for the detailed explanation. I am also going over your talk as we
speak. A follow up to your index design...Curious to know what advantage
does the nested doc design provide in this case?


If my understanding is correct, had the parent and child docs been
unrelated docs connected by a "secondary key" in the child docs (say
"ParentId"), you could still have used the "join" parser and achieved the
same result as the "parent" parser, no?

Especially since the JIRA for getting topK parent hits is still in progress
(https://issues.apache.org/jira/browse/SOLR-17736).

How are you handling any changes to your child docs? (Since you'd need to
reindex the whole block I assume? )

Thanks,
Rahul

On Mon, Oct 6, 2025 at 2:49 PM Burgmans, Tom
<[email protected]> wrote:

> We managed to get our required flavor hybrid search working in Solr via a
> nested index.
>
> The required flavor: applying both lexical search and vector search in a
> single search call with a logical OR (a document could match pure
> lexically, pure as vector match or both). Our documents are large enough
> that chunking is needed and of course no duplication of results are allowed.
>
> The nested index is a construction where the parent documents are the
> large original documents and the children are the chunks that are
> vectorized:
>
> <doc>
>         <field name="id">doc-1</field>
>         <field name="type_s">parent</field>
>         <field name="full_doc_title">This is the title text</field>
>         <field name="full_doc_body">This is the full body text</field>
>         <field name="metadata1_s">some metadata</field>
>         <doc>
>                 <field name="id">doc-1.1</field>
>                 <field name="parentDoc">doc-1</field>
>                 <field name="type_s">child</field>
>                 <field name="chunk_body">This is the chunk body
> text</field>
>                 <field name="chunkoffsets_s">8123-12123</field>
>                 <field
> name="vector_field"><![CDATA[-0.0037859276]]></field>
>                 <field name="vector_field"><![CDATA[-0.012503299]]></field>
>                 <field name="vector_field"><![CDATA[0.018080892]]></field>
>                 <field name="vector_field"><![CDATA[0.0024048693]]></field>
>                 ...
>             </doc>
>             <doc>
>                  <field name="id">doc-1.2</field>
>                 <field name="parentDoc">doc-1</field>
>                 <field name="type_s">child</field>
>                 <field name="chunk_body">This is the body text of another
> chunk</field
>                 <field name="chunkoffsets_s">12200-12788</field>
>                 <field
> name="vector_field"><![CDATA[}[-0.0034859276]]></field>
>                 <field name="vector_field"><![CDATA[0.0024048693]]></field>
>                 <field name="vector_field"><![CDATA[-0.016224038]]></field>
>                 <field name="vector_field"><![CDATA[0.025224038]]></field>
>                 ...
>         </doc>
>   <doc>
>   ...
>   </doc>
> </doc>
>
> This query construction searches the parents lexically and the children
> via ANN search. The result set contain full documents only. Balancing the
> impact of lexical vs vector happens via kwweight and vectorweight (these
> values may change per query, depending on its nature). Note that this
> construction doesn't include score normalization, because this is an
> expensive operation when there are many results and moreover normalization
> doesn't guarantee proper blending of relevant lexical and vector results.
>
> params:{
>   uf:"* _query_",
>   q:"{!bool filter=$hybridlogic must=$hybridscore}",
>   hybridlogic:"{!bool should=$kwq should=$vectorq}",
>
> hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
>   kwq:"{!type=edismax qf=\"full_doc_body full_doc_title^3\" v=$qq}",
>   qq:"What is the income tax in New York?",
>   vectorq:"{!parent which=\"type_s:parent\" score=max v=$childq}",
>   childq:"{!knn f=vector_field
> topK=10}[-0.0034859276,-0.028224038,0.0024048693,...]",
>   kwweight:1,
>   vectorweight:4
> }
>
> This nested index is multi-purpose: for hybrid searching full documents
> (the construction above) and for hybrid searching the chunks only (see
> below).
>
> This following query construction searches the chunks both lexically and
> via ANN search. The result set contain chunks only. This is meant for RAG
> use cases where we're only interested in document chunks as context for the
> LLM.
>
> params:{
>   uf:"* _query_",
>   q:"{!bool filter=$hybridlogic must=$hybridscore}",
>   hybridlogic:"{!bool should=$kwq should=$vectorq}",
>
> hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
>   kwq:"{!type=edismax qf=\"chunk_body\" v=$qq}",
>   qq:"What is the income tax in New York?",
>   vectorq:"{!knn f=vector_field
> topK=10}[-0.002503299,-0.001550957,0.018080892,...]",
>   kwweight:1,
>   vectorweight:4
> }
>
> We recently gave a presentation about this and other things at the
> Haystack EU 2025 conference: https://www.youtube.com/watch?v=3CPa1MpnLlI
>
>
> Regards, Tom
>
>
>
>
> -----Original Message-----
> From: Rahul Goswami <[email protected]>
> Sent: Sunday, August 31, 2025 2:08 PM
> To: [email protected]
> Subject: Re: TopK strategy for vectorized chunks in Solr
>
> Caution, this email may be from a sender outside Wolters Kluwer. Verify
> the sender and know the content is safe.
>
> Hello,
> Floating this up again in case anyone has any insights. Thanks.
>
> Rahul
>
> On Fri, Aug 15, 2025 at 11:45 AM Rahul Goswami <[email protected]>
> wrote:
>
> > Hello,
> > A question for folks using Solr as the vector db in their solutions.
> > As of now since Solr doesn't support parent/child or multi-valued
> > vector field support for vector search, what are some strategies that
> > can be used to avoid duplicates in top K results when you have
> > vectorized chunks for the same (large) document?
> >
> > Would be also helpful to know how folks are doing this when storing
> > vectors in the same docs as the lexical index vs when having the
> > vectorized chunks in a separate index.
> >
> > Thanks.
> > Rahul
> >
>

Re: TopK strategy for vectorized chunks in Solr

Reply via email to