Hi Shiva, Thanks for your reply.
Somehow I got confused about TOP-K. I thought each partition could have an overlapping portion from the static part. So, will each partition be processed on a single node? Regarding the memory part, I'm glad to know that the size is not that huge. :-) Best, Taewoo On Wed, Jan 14, 2026 at 9:25 AM Shiva Jahangiri <[email protected]> wrote: > Hi Taewoo, > > Thanks for the great questions. > > With regard to the distributed top-k, each data partition will return its > top-k result (if it has at least k records) and then we get the global > top-k based on these local ones (somewhat similar to group by). > > With regard to the memory usage, the static part has to remain in the > memory and our experiments have showed that its size is not that large > compared to the size of the data (for 18GB of data stored in one data > partition,1 Million records each with an embedding with the size of 960 > dimensions the static part takes 11MB of memory). The reason that our index > is not memory hungry is that we only have embeddings in the static part, > the data pages in the dynamic part where the records will be inserted does > not store the embedding of the record, instead it stores its distance to > the cluster’s centroid. We will later on explore storing the quantized > vectors for each record (helps reducing execution time by sending lesser > records to the primary index for distance calculations) and that might > change the size of the dynamic section. It is important to note that each > time a new memory component is created the static structure is copied into > the memory component and the dynamic part will be filled with the incoming > data. > > > > Best, > Shiva > > Shiva Jahangiri > Assistant Professor in Computer Science and Engineering Department > Santa Clara University > > > > On Tue, Jan 13, 2026 at 3:24 PM Taewoo Kim <[email protected]> wrote: > > > Hi Shiva, > > > > This proposal looks good. > > > > I have two questions (sorry if I missed) > > > > How are we going to handle distributed execution when dealing with top-K > > ANN? > > How does the memory component look like in terms of configurable size? My > > naive understanding is that Vector index itself is memory-hungry. > > > > Best, > > Taewoo > > > > > > On Tue, Jan 13, 2026 at 2:04 PM Shiva Jahangiri <[email protected]> > > wrote: > > > > > Hi all, > > > > > > Initiating discussion to add vector index in AsterixDB to support > > > approximate nearest neighbor (ANN) queries. > > > > > > Feature: Adding Vector Index > > > > > > Details: Currently AsterixDB does not support approximate nearest > queries > > > and similarity search on vector embeddings. This proposal suggests the > > > first design of a tree-based vector indexing supporting top-k ANN > queries > > > which is fully compatible with LSM structure of AsterixDB's storage. As > > > part of this proposal we provide support for : > > > > > > * Adding vector distance functions to support K-Nearest Neighbor (KNN) > > > queries > > > * Adding vector index to support ANN queries > > > * Adding support for INCLUDE fields in vector index to better support > > > filtered similarity search. > > > > > > APE: > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*31*3A*Vector*Index__;KyUrKw!!MLMg-p0Z!D1Zmu-mN_byA8sV_P_p7aLlbYJIg0b19njsPeaMkVTovtWeW0IsD-CIgOo0MJ7_7t3pZsk63GqP6lfK1$ > > > > > > Thanks, > > > Shiva > > > > > > -- > > > Shiva Jahangiri > > > Assistant Professor in Computer Science and Engineering Department > > > Santa Clara University > > > > > >
