Hi Taewoo,

Thanks for the great questions.

With regard to the distributed top-k, each data partition will return its
top-k result (if it has at least k records) and then we get the global
top-k based on these local ones (somewhat similar to group by).

With regard to the memory usage, the static part has to remain in the
memory and our experiments have showed that its size is not that large
compared to the size of the data (for 18GB of data stored in one data
partition,1 Million records each with an embedding with the size of 960
dimensions the static part takes 11MB of memory). The reason that our index
is not memory hungry is that we only have embeddings in the static part,
the data pages in the dynamic part where the records will be inserted does
not store the embedding of the record, instead it stores its distance to
the cluster’s centroid. We will later on explore storing the quantized
vectors for each record (helps reducing execution time by sending lesser
records to the primary index for distance calculations) and that might
change the size of the dynamic section. It is important to note that each
time a new memory component is created the static structure is copied into
the memory component and the dynamic part will be filled with the incoming
data.



Best,
Shiva

Shiva Jahangiri
Assistant Professor in Computer Science and Engineering Department
Santa Clara University



On Tue, Jan 13, 2026 at 3:24 PM Taewoo Kim <[email protected]> wrote:

> Hi Shiva,
>
> This proposal looks good.
>
> I have two questions (sorry if I missed)
>
> How are we going to handle distributed execution when dealing with top-K
> ANN?
> How does the memory component look like in terms of configurable size? My
> naive understanding is that Vector index itself is memory-hungry.
>
> Best,
> Taewoo
>
>
> On Tue, Jan 13, 2026 at 2:04 PM Shiva Jahangiri <[email protected]>
> wrote:
>
> > Hi all,
> >
> > Initiating discussion to add vector index in AsterixDB to support
> > approximate nearest neighbor (ANN) queries.
> >
> >  Feature: Adding Vector Index
> >
> > Details: Currently AsterixDB does not support approximate nearest queries
> > and similarity search on vector embeddings. This proposal suggests the
> > first design of a tree-based vector indexing supporting top-k ANN queries
> > which is fully compatible with LSM structure of AsterixDB's storage. As
> > part of this proposal we provide support for :
> >
> > * Adding vector distance functions to support K-Nearest Neighbor (KNN)
> > queries
> > * Adding vector index to support ANN queries
> > * Adding support for INCLUDE fields in vector index to better support
> > filtered similarity search.
> >
> > APE:
> >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*31*3A*Vector*Index__;KyUrKw!!MLMg-p0Z!D1Zmu-mN_byA8sV_P_p7aLlbYJIg0b19njsPeaMkVTovtWeW0IsD-CIgOo0MJ7_7t3pZsk63GqP6lfK1$
> >
> > Thanks,
> > Shiva
> >
> > --
> > Shiva Jahangiri
> > Assistant Professor in Computer Science and Engineering Department
> > Santa Clara University
> >
>

Reply via email to