We managed to get our required flavor hybrid search working in Solr via a
nested index.
The required flavor: applying both lexical search and vector search in a single
search call with a logical OR (a document could match pure lexically, pure as
vector match or both). Our documents are large enough that chunking is needed
and of course no duplication of results are allowed.
The nested index is a construction where the parent documents are the large
original documents and the children are the chunks that are vectorized:
<doc>
<field name="id">doc-1</field>
<field name="type_s">parent</field>
<field name="full_doc_title">This is the title text</field>
<field name="full_doc_body">This is the full body text</field>
<field name="metadata1_s">some metadata</field>
<doc>
<field name="id">doc-1.1</field>
<field name="parentDoc">doc-1</field>
<field name="type_s">child</field>
<field name="chunk_body">This is the chunk body text</field>
<field name="chunkoffsets_s">8123-12123</field>
<field name="vector_field"><![CDATA[-0.0037859276]]></field>
<field name="vector_field"><![CDATA[-0.012503299]]></field>
<field name="vector_field"><![CDATA[0.018080892]]></field>
<field name="vector_field"><![CDATA[0.0024048693]]></field>
...
</doc>
<doc>
<field name="id">doc-1.2</field>
<field name="parentDoc">doc-1</field>
<field name="type_s">child</field>
<field name="chunk_body">This is the body text of another
chunk</field
<field name="chunkoffsets_s">12200-12788</field>
<field name="vector_field"><![CDATA[}[-0.0034859276]]></field>
<field name="vector_field"><![CDATA[0.0024048693]]></field>
<field name="vector_field"><![CDATA[-0.016224038]]></field>
<field name="vector_field"><![CDATA[0.025224038]]></field>
...
</doc>
<doc>
...
</doc>
</doc>
This query construction searches the parents lexically and the children via ANN
search. The result set contain full documents only. Balancing the impact of
lexical vs vector happens via kwweight and vectorweight (these values may
change per query, depending on its nature). Note that this construction doesn't
include score normalization, because this is an expensive operation when there
are many results and moreover normalization doesn't guarantee proper blending
of relevant lexical and vector results.
params:{
uf:"* _query_",
q:"{!bool filter=$hybridlogic must=$hybridscore}",
hybridlogic:"{!bool should=$kwq should=$vectorq}",
hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
kwq:"{!type=edismax qf=\"full_doc_body full_doc_title^3\" v=$qq}",
qq:"What is the income tax in New York?",
vectorq:"{!parent which=\"type_s:parent\" score=max v=$childq}",
childq:"{!knn f=vector_field
topK=10}[-0.0034859276,-0.028224038,0.0024048693,...]",
kwweight:1,
vectorweight:4
}
This nested index is multi-purpose: for hybrid searching full documents (the
construction above) and for hybrid searching the chunks only (see below).
This following query construction searches the chunks both lexically and via
ANN search. The result set contain chunks only. This is meant for RAG use cases
where we're only interested in document chunks as context for the LLM.
params:{
uf:"* _query_",
q:"{!bool filter=$hybridlogic must=$hybridscore}",
hybridlogic:"{!bool should=$kwq should=$vectorq}",
hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
kwq:"{!type=edismax qf=\"chunk_body\" v=$qq}",
qq:"What is the income tax in New York?",
vectorq:"{!knn f=vector_field
topK=10}[-0.002503299,-0.001550957,0.018080892,...]",
kwweight:1,
vectorweight:4
}
We recently gave a presentation about this and other things at the Haystack EU
2025 conference: https://www.youtube.com/watch?v=3CPa1MpnLlI
Regards, Tom
-----Original Message-----
From: Rahul Goswami <[email protected]>
Sent: Sunday, August 31, 2025 2:08 PM
To: [email protected]
Subject: Re: TopK strategy for vectorized chunks in Solr
Caution, this email may be from a sender outside Wolters Kluwer. Verify the
sender and know the content is safe.
Hello,
Floating this up again in case anyone has any insights. Thanks.
Rahul
On Fri, Aug 15, 2025 at 11:45 AM Rahul Goswami <[email protected]>
wrote:
> Hello,
> A question for folks using Solr as the vector db in their solutions.
> As of now since Solr doesn't support parent/child or multi-valued
> vector field support for vector search, what are some strategies that
> can be used to avoid duplicates in top K results when you have
> vectorized chunks for the same (large) document?
>
> Would be also helpful to know how folks are doing this when storing
> vectors in the same docs as the lexical index vs when having the
> vectorized chunks in a separate index.
>
> Thanks.
> Rahul
>