RE: TopK strategy for vectorized chunks in Solr

Burgmans, Tom Fri, 17 Oct 2025 13:31:55 -0700

We managed to get our required flavor hybrid search working in Solr via a 
nested index.

The required flavor: applying both lexical search and vector search in a single 
search call with a logical OR (a document could match pure lexically, pure as 
vector match or both). Our documents are large enough that chunking is needed 
and of course no duplication of results are allowed.

The nested index is a construction where the parent documents are the large 
original documents and the children are the chunks that are vectorized:

<doc>
        <field name="id">doc-1</field>
        <field name="type_s">parent</field>
        <field name="full_doc_title">This is the title text</field>
        <field name="full_doc_body">This is the full body text</field>
        <field name="metadata1_s">some metadata</field>
        <doc>
                <field name="id">doc-1.1</field>    
                <field name="parentDoc">doc-1</field>    
                <field name="type_s">child</field>    
                <field name="chunk_body">This is the chunk body text</field>    
                <field name="chunkoffsets_s">8123-12123</field>    
                <field name="vector_field"><![CDATA[-0.0037859276]]></field>
                <field name="vector_field"><![CDATA[-0.012503299]]></field>
                <field name="vector_field"><![CDATA[0.018080892]]></field>
                <field name="vector_field"><![CDATA[0.0024048693]]></field>
                ...
            </doc>
            <doc>
                 <field name="id">doc-1.2</field>    
                <field name="parentDoc">doc-1</field>    
                <field name="type_s">child</field>    
                <field name="chunk_body">This is the body text of another 
chunk</field
                <field name="chunkoffsets_s">12200-12788</field>    
                <field name="vector_field"><![CDATA[}[-0.0034859276]]></field>
                <field name="vector_field"><![CDATA[0.0024048693]]></field>
                <field name="vector_field"><![CDATA[-0.016224038]]></field>
                <field name="vector_field"><![CDATA[0.025224038]]></field>
                ...
        </doc>
  <doc>
  ...
  </doc>
</doc>

This query construction searches the parents lexically and the children via ANN 
search. The result set contain full documents only. Balancing the impact of 
lexical vs vector happens via kwweight and vectorweight (these values may 
change per query, depending on its nature). Note that this construction doesn't 
include score normalization, because this is an expensive operation when there 
are many results and moreover normalization doesn't guarantee proper blending 
of relevant lexical and vector results.

params:{
  uf:"* _query_",
  q:"{!bool filter=$hybridlogic must=$hybridscore}",
  hybridlogic:"{!bool should=$kwq should=$vectorq}",

hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
  kwq:"{!type=edismax qf=\"full_doc_body full_doc_title^3\" v=$qq}",
  qq:"What is the income tax in New York?",  
  vectorq:"{!parent which=\"type_s:parent\" score=max v=$childq}",
  childq:"{!knn f=vector_field 
topK=10}[-0.0034859276,-0.028224038,0.0024048693,...]",  
  kwweight:1,
  vectorweight:4
}

This nested index is multi-purpose: for hybrid searching full documents (the 
construction above) and for hybrid searching the chunks only (see below).

This following query construction searches the chunks both lexically and via 
ANN search. The result set contain chunks only. This is meant for RAG use cases 
where we're only interested in document chunks as context for the LLM.

params:{
  uf:"* _query_",
  q:"{!bool filter=$hybridlogic must=$hybridscore}",
  hybridlogic:"{!bool should=$kwq should=$vectorq}",

hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
  kwq:"{!type=edismax qf=\"chunk_body\" v=$qq}",
  qq:"What is the income tax in New York?",
  vectorq:"{!knn f=vector_field 
topK=10}[-0.002503299,-0.001550957,0.018080892,...]",    
  kwweight:1,
  vectorweight:4
}

We recently gave a presentation about this and other things at the Haystack EU 
2025 conference: https://www.youtube.com/watch?v=3CPa1MpnLlI

Regards, Tom

-----Original Message-----
From: Rahul Goswami <[email protected]> 
Sent: Sunday, August 31, 2025 2:08 PM
To: [email protected]
Subject: Re: TopK strategy for vectorized chunks in Solr

Caution, this email may be from a sender outside Wolters Kluwer. Verify the 
sender and know the content is safe.

Hello,
Floating this up again in case anyone has any insights. Thanks.

Rahul

On Fri, Aug 15, 2025 at 11:45 AM Rahul Goswami <[email protected]>
wrote:

> Hello,
> A question for folks using Solr as the vector db in their solutions. 
> As of now since Solr doesn't support parent/child or multi-valued 
> vector field support for vector search, what are some strategies that 
> can be used to avoid duplicates in top K results when you have 
> vectorized chunks for the same (large) document?
>
> Would be also helpful to know how folks are doing this when storing 
> vectors in the same docs as the lexical index vs when having the 
> vectorized chunks in a separate index.
>
> Thanks.
> Rahul
>

RE: TopK strategy for vectorized chunks in Solr

Reply via email to