Hi Shiva,

Thanks for your reply.

Somehow I got confused about TOP-K. I thought each partition could have an
overlapping portion from the static part. So, will each partition be
processed on a single node?

Regarding the memory part, I'm glad to know that the size is not that huge.
:-)

Best,
Taewoo


On Wed, Jan 14, 2026 at 9:25 AM Shiva Jahangiri <[email protected]> wrote:

> Hi Taewoo,
>
> Thanks for the great questions.
>
> With regard to the distributed top-k, each data partition will return its
> top-k result (if it has at least k records) and then we get the global
> top-k based on these local ones (somewhat similar to group by).
>
> With regard to the memory usage, the static part has to remain in the
> memory and our experiments have showed that its size is not that large
> compared to the size of the data (for 18GB of data stored in one data
> partition,1 Million records each with an embedding with the size of 960
> dimensions the static part takes 11MB of memory). The reason that our index
> is not memory hungry is that we only have embeddings in the static part,
> the data pages in the dynamic part where the records will be inserted does
> not store the embedding of the record, instead it stores its distance to
> the cluster’s centroid. We will later on explore storing the quantized
> vectors for each record (helps reducing execution time by sending lesser
> records to the primary index for distance calculations) and that might
> change the size of the dynamic section. It is important to note that each
> time a new memory component is created the static structure is copied into
> the memory component and the dynamic part will be filled with the incoming
> data.
>
>
>
> Best,
> Shiva
>
> Shiva Jahangiri
> Assistant Professor in Computer Science and Engineering Department
> Santa Clara University
>
>
>
> On Tue, Jan 13, 2026 at 3:24 PM Taewoo Kim <[email protected]> wrote:
>
> > Hi Shiva,
> >
> > This proposal looks good.
> >
> > I have two questions (sorry if I missed)
> >
> > How are we going to handle distributed execution when dealing with top-K
> > ANN?
> > How does the memory component look like in terms of configurable size? My
> > naive understanding is that Vector index itself is memory-hungry.
> >
> > Best,
> > Taewoo
> >
> >
> > On Tue, Jan 13, 2026 at 2:04 PM Shiva Jahangiri <[email protected]>
> > wrote:
> >
> > > Hi all,
> > >
> > > Initiating discussion to add vector index in AsterixDB to support
> > > approximate nearest neighbor (ANN) queries.
> > >
> > >  Feature: Adding Vector Index
> > >
> > > Details: Currently AsterixDB does not support approximate nearest
> queries
> > > and similarity search on vector embeddings. This proposal suggests the
> > > first design of a tree-based vector indexing supporting top-k ANN
> queries
> > > which is fully compatible with LSM structure of AsterixDB's storage. As
> > > part of this proposal we provide support for :
> > >
> > > * Adding vector distance functions to support K-Nearest Neighbor (KNN)
> > > queries
> > > * Adding vector index to support ANN queries
> > > * Adding support for INCLUDE fields in vector index to better support
> > > filtered similarity search.
> > >
> > > APE:
> > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*31*3A*Vector*Index__;KyUrKw!!MLMg-p0Z!D1Zmu-mN_byA8sV_P_p7aLlbYJIg0b19njsPeaMkVTovtWeW0IsD-CIgOo0MJ7_7t3pZsk63GqP6lfK1$
> > >
> > > Thanks,
> > > Shiva
> > >
> > > --
> > > Shiva Jahangiri
> > > Assistant Professor in Computer Science and Engineering Department
> > > Santa Clara University
> > >
> >
>

Reply via email to