Re: TopK strategy for vectorized chunks in Solr

Rahul Goswami Wed, 17 Sep 2025 21:15:20 -0700

Hi Guillaume,
Thank you for a detailed and thoughtfully crafted response with examples.
And apologies for not being able to respond sooner.


I do have a few follow up questions:

1) How many docs (parent+chunks) does your index hold?

2) Is the query time join scaling well?

3) For pre-filtering did you consider duplicating minimal metadata in chunk
docs to be able to directly pre-filter on chunks instead of joining back to
parent?

4) Do you have use case for fetching top k parents instead? If so how are
you reliably achieving this since multiple chunks could correspond to the
same parent, shrinking your result set?

5) As a follow up to #4, do you have a use-case for pagination based on
vector search? If so how are you achieving this (because of the same
constraint as in #4)?

Thanks in advance!

-Rahul

On Tue, Sep 2, 2025 at 1:16 PM Guillaume <[email protected]> wrote:

> Hello Rahul,
>
> Currently, I’m using the following topology:
>
> * I index my documents records in the usual way.
> * I index the chunks records by referencing their parent record id.
>
> Concretely, this looks like (simplified version):
>
> Document 1
>   -id: DOC_1
>   -title: 2025 Annual Report
>   -document_type: PDF
>
> Chunk 1 of document 1
>   -id: CHUNK_1_1
>   -text: <text of the first chunk>
>   -vector: <embedding of the first chunk>
>   -parent_id: DOC_1
>   -position: 0
>
> Chunk 2 of document 1
>   -id: CHUNK_1_2
>   -text: <text of the second chunk>
>   -vector: <embedding of the second chunk>
>   -parent_id: DOC_1
>   -position: 1
> …
> …
>
> When I want to retrieve documents via a semantic search on the chunks, I
> use a join, like this:
> q={!join from=parent_id to=id score=max}{!knn f=vector
> topK=100}[0.255,0.36,…]
>
> Using the aggregation guarantees that I won’t get duplicate documents in
> the result set. However, even though I request 100 chunks (TopK), I’ll
> probably get fewer documents because several chunks may belong to the same
> document. I use the “max” aggregation to rank documents by their best
> chunk.
>
> If I need to apply a filter on the **documents** (e.g., restrict the
> semantic search to PDF documents), things get a bit more complicated
> because the filtering must happen in the `preFilter` of the KNN search.
> Here’s an example:
>
> q={!join from=parent_id to=id score=max}{!knn f=vector topK=100
> preFilter=$type_prefilter}[0.255,0.36,…]&type_prefilter={!join from=id
> to=parent_id score=none} document_type:PDF
>
> The pre‑filtering is performed on the **documents**. Then, the join fetches
> the chunks associated with documents that satisfy the constraint
> (`type:PDF`). Those resulting chunks become the corpus for the main
> semantic search (through preFilter).
>
> This indexing system works great for me because it lets me manage document
> indexing and chunk indexing in a completely decoupled way. Solutions based
> on “partial updates” or “nested documents” are problematic for me because I
> can’t guarantee that all fields are `stored`, and I don’t want to have to
> rebuild the documents when I index chunks.
>
> I'm sure a better way to do that must exist. Especially because *joins
> always end up becoming a problem as the number of documents grows* (even
> with docValues).
>
> Hope this helps you!
>
> By the way, here’s an excellent video by Alessandro Benedetti that I
> thought you might like :
> https://youtu.be/9KJTbgtFWOU?si=YAUPNvfDhlX3NmJc&t=1450
>
> Guillaume
>
>
>
> Le dim. 31 août 2025 à 16:08, Sergio García Maroto <[email protected]> a
> écrit :
>
> > Hi Rahul.
> >
> > Have you explored the possibility of using streaming expressions? You can
> > get back tuples and group
> > them?
> >
> > Regards
> > Sergio
> >
> > On Sun 31 Aug 2025 at 14:09, Rahul Goswami <[email protected]>
> wrote:
> >
> > > Hello,
> > > Floating this up again in case anyone has any insights. Thanks.
> > >
> > > Rahul
> > >
> > > On Fri, Aug 15, 2025 at 11:45 AM Rahul Goswami <[email protected]>
> > > wrote:
> > >
> > > > Hello,
> > > > A question for folks using Solr as the vector db in their solutions.
> As
> > > of
> > > > now since Solr doesn't support parent/child or multi-valued vector
> > field
> > > > support for vector search, what are some strategies that can be used
> to
> > > > avoid duplicates in top K results when you have vectorized chunks for
> > the
> > > > same (large) document?
> > > >
> > > > Would be also helpful to know how folks are doing this when storing
> > > > vectors in the same docs as the lexical index vs when having the
> > > vectorized
> > > > chunks in a separate index.
> > > >
> > > > Thanks.
> > > > Rahul
> > > >
> > >
> >
>

Re: TopK strategy for vectorized chunks in Solr

Reply via email to