Re: [DISCUSS][APE] Adding Vector Index

Shiva Jahangiri Wed, 14 Jan 2026 11:36:06 -0800

Hi Taewoo,

Thanks! So each data partition will have its own vector index as secondary
index, and so each data partition does its own sampling of data, creates
its own static structure, etc. using its own data. That means there is no
overlapping or connection between the data or static structure of vector
indexes of different data partitions.


For top-k, we basically ask each data partition to give us their top-k
results. If we have N data partitions, we will get N*k results which then
in a global step we get its top-k results out.

Each node can have multiple data partitions which makes it simple in the
shared-nothing architecture. In the cloud mode, the code is modified to
make sure that the search goes through all data partitions even if multiple
of them are managed by a single compute node.

Best,
Shiva






On Wed, Jan 14, 2026 at 10:39 AM Taewoo Kim <[email protected]> wrote:

> Hi Shiva,
>
> Thanks for your reply.
>
> Somehow I got confused about TOP-K. I thought each partition could have an
> overlapping portion from the static part. So, will each partition be
> processed on a single node?
>
> Regarding the memory part, I'm glad to know that the size is not that huge.
> :-)
>
> Best,
> Taewoo
>
>
> On Wed, Jan 14, 2026 at 9:25 AM Shiva Jahangiri <[email protected]>
> wrote:
>
> > Hi Taewoo,
> >
> > Thanks for the great questions.
> >
> > With regard to the distributed top-k, each data partition will return its
> > top-k result (if it has at least k records) and then we get the global
> > top-k based on these local ones (somewhat similar to group by).
> >
> > With regard to the memory usage, the static part has to remain in the
> > memory and our experiments have showed that its size is not that large
> > compared to the size of the data (for 18GB of data stored in one data
> > partition,1 Million records each with an embedding with the size of 960
> > dimensions the static part takes 11MB of memory). The reason that our
> index
> > is not memory hungry is that we only have embeddings in the static part,
> > the data pages in the dynamic part where the records will be inserted
> does
> > not store the embedding of the record, instead it stores its distance to
> > the cluster’s centroid. We will later on explore storing the quantized
> > vectors for each record (helps reducing execution time by sending lesser
> > records to the primary index for distance calculations) and that might
> > change the size of the dynamic section. It is important to note that each
> > time a new memory component is created the static structure is copied
> into
> > the memory component and the dynamic part will be filled with the
> incoming
> > data.
> >
> >
> >
> > Best,
> > Shiva
> >
> > Shiva Jahangiri
> > Assistant Professor in Computer Science and Engineering Department
> > Santa Clara University
> >
> >
> >
> > On Tue, Jan 13, 2026 at 3:24 PM Taewoo Kim <[email protected]> wrote:
> >
> > > Hi Shiva,
> > >
> > > This proposal looks good.
> > >
> > > I have two questions (sorry if I missed)
> > >
> > > How are we going to handle distributed execution when dealing with
> top-K
> > > ANN?
> > > How does the memory component look like in terms of configurable size?
> My
> > > naive understanding is that Vector index itself is memory-hungry.
> > >
> > > Best,
> > > Taewoo
> > >
> > >
> > > On Tue, Jan 13, 2026 at 2:04 PM Shiva Jahangiri <[email protected]>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > Initiating discussion to add vector index in AsterixDB to support
> > > > approximate nearest neighbor (ANN) queries.
> > > >
> > > >  Feature: Adding Vector Index
> > > >
> > > > Details: Currently AsterixDB does not support approximate nearest
> > queries
> > > > and similarity search on vector embeddings. This proposal suggests
> the
> > > > first design of a tree-based vector indexing supporting top-k ANN
> > queries
> > > > which is fully compatible with LSM structure of AsterixDB's storage.
> As
> > > > part of this proposal we provide support for :
> > > >
> > > > * Adding vector distance functions to support K-Nearest Neighbor
> (KNN)
> > > > queries
> > > > * Adding vector index to support ANN queries
> > > > * Adding support for INCLUDE fields in vector index to better support
> > > > filtered similarity search.
> > > >
> > > > APE:
> > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*31*3A*Vector*Index__;KyUrKw!!MLMg-p0Z!D1Zmu-mN_byA8sV_P_p7aLlbYJIg0b19njsPeaMkVTovtWeW0IsD-CIgOo0MJ7_7t3pZsk63GqP6lfK1$
> > > >
> > > > Thanks,
> > > > Shiva
> > > >
> > > > --
> > > > Shiva Jahangiri
> > > > Assistant Professor in Computer Science and Engineering Department
> > > > Santa Clara University
> > > >
> > >
> >
>


-- 
Shiva Jahangiri
Assistant Professor in Computer Science and Engineering Department
Santa Clara University

Re: [DISCUSS][APE] Adding Vector Index

Reply via email to