Hi Guillaume, Thank you for a detailed and thoughtfully crafted response with examples. And apologies for not being able to respond sooner.
I do have a few follow up questions: 1) How many docs (parent+chunks) does your index hold? 2) Is the query time join scaling well? 3) For pre-filtering did you consider duplicating minimal metadata in chunk docs to be able to directly pre-filter on chunks instead of joining back to parent? 4) Do you have use case for fetching top k parents instead? If so how are you reliably achieving this since multiple chunks could correspond to the same parent, shrinking your result set? 5) As a follow up to #4, do you have a use-case for pagination based on vector search? If so how are you achieving this (because of the same constraint as in #4)? Thanks in advance! -Rahul On Tue, Sep 2, 2025 at 1:16 PM Guillaume <[email protected]> wrote: > Hello Rahul, > > Currently, I’m using the following topology: > > * I index my documents records in the usual way. > * I index the chunks records by referencing their parent record id. > > Concretely, this looks like (simplified version): > > Document 1 > -id: DOC_1 > -title: 2025 Annual Report > -document_type: PDF > > Chunk 1 of document 1 > -id: CHUNK_1_1 > -text: <text of the first chunk> > -vector: <embedding of the first chunk> > -parent_id: DOC_1 > -position: 0 > > Chunk 2 of document 1 > -id: CHUNK_1_2 > -text: <text of the second chunk> > -vector: <embedding of the second chunk> > -parent_id: DOC_1 > -position: 1 > … > … > > When I want to retrieve documents via a semantic search on the chunks, I > use a join, like this: > q={!join from=parent_id to=id score=max}{!knn f=vector > topK=100}[0.255,0.36,…] > > Using the aggregation guarantees that I won’t get duplicate documents in > the result set. However, even though I request 100 chunks (TopK), I’ll > probably get fewer documents because several chunks may belong to the same > document. I use the “max” aggregation to rank documents by their best > chunk. > > If I need to apply a filter on the **documents** (e.g., restrict the > semantic search to PDF documents), things get a bit more complicated > because the filtering must happen in the `preFilter` of the KNN search. > Here’s an example: > > q={!join from=parent_id to=id score=max}{!knn f=vector topK=100 > preFilter=$type_prefilter}[0.255,0.36,…]&type_prefilter={!join from=id > to=parent_id score=none} document_type:PDF > > The pre‑filtering is performed on the **documents**. Then, the join fetches > the chunks associated with documents that satisfy the constraint > (`type:PDF`). Those resulting chunks become the corpus for the main > semantic search (through preFilter). > > This indexing system works great for me because it lets me manage document > indexing and chunk indexing in a completely decoupled way. Solutions based > on “partial updates” or “nested documents” are problematic for me because I > can’t guarantee that all fields are `stored`, and I don’t want to have to > rebuild the documents when I index chunks. > > I'm sure a better way to do that must exist. Especially because *joins > always end up becoming a problem as the number of documents grows* (even > with docValues). > > Hope this helps you! > > By the way, here’s an excellent video by Alessandro Benedetti that I > thought you might like : > https://youtu.be/9KJTbgtFWOU?si=YAUPNvfDhlX3NmJc&t=1450 > > Guillaume > > > > Le dim. 31 août 2025 à 16:08, Sergio García Maroto <[email protected]> a > écrit : > > > Hi Rahul. > > > > Have you explored the possibility of using streaming expressions? You can > > get back tuples and group > > them? > > > > Regards > > Sergio > > > > On Sun 31 Aug 2025 at 14:09, Rahul Goswami <[email protected]> > wrote: > > > > > Hello, > > > Floating this up again in case anyone has any insights. Thanks. > > > > > > Rahul > > > > > > On Fri, Aug 15, 2025 at 11:45 AM Rahul Goswami <[email protected]> > > > wrote: > > > > > > > Hello, > > > > A question for folks using Solr as the vector db in their solutions. > As > > > of > > > > now since Solr doesn't support parent/child or multi-valued vector > > field > > > > support for vector search, what are some strategies that can be used > to > > > > avoid duplicates in top K results when you have vectorized chunks for > > the > > > > same (large) document? > > > > > > > > Would be also helpful to know how folks are doing this when storing > > > > vectors in the same docs as the lexical index vs when having the > > > vectorized > > > > chunks in a separate index. > > > > > > > > Thanks. > > > > Rahul > > > > > > > > > >
