+1 to this APE. Supporting ANN queries using a vector index,
especially a novel one like this, is awesome.
I think there are still some loose ends about the distance functions
and some of the exact form of the WITH clauses but these are minor
details; I don't think they need to block acceptance.

On Wed, Jan 14, 2026 at 11:57 AM Taewoo Kim <[email protected]> wrote:
>
> Thanks for the clarification!
>
> Best,
> Taewoo
>
>
> On Wed, Jan 14, 2026 at 11:36 AM Shiva Jahangiri <[email protected]> wrote:
>
> > Hi Taewoo,
> >
> > Thanks! So each data partition will have its own vector index as secondary
> > index, and so each data partition does its own sampling of data, creates
> > its own static structure, etc. using its own data. That means there is no
> > overlapping or connection between the data or static structure of vector
> > indexes of different data partitions.
> >
> > For top-k, we basically ask each data partition to give us their top-k
> > results. If we have N data partitions, we will get N*k results which then
> > in a global step we get its top-k results out.
> >
> > Each node can have multiple data partitions which makes it simple in the
> > shared-nothing architecture. In the cloud mode, the code is modified to
> > make sure that the search goes through all data partitions even if multiple
> > of them are managed by a single compute node.
> >
> > Best,
> > Shiva
> >
> >
> >
> >
> >
> >
> > On Wed, Jan 14, 2026 at 10:39 AM Taewoo Kim <[email protected]> wrote:
> >
> > > Hi Shiva,
> > >
> > > Thanks for your reply.
> > >
> > > Somehow I got confused about TOP-K. I thought each partition could have
> > an
> > > overlapping portion from the static part. So, will each partition be
> > > processed on a single node?
> > >
> > > Regarding the memory part, I'm glad to know that the size is not that
> > huge.
> > > :-)
> > >
> > > Best,
> > > Taewoo
> > >
> > >
> > > On Wed, Jan 14, 2026 at 9:25 AM Shiva Jahangiri <[email protected]>
> > > wrote:
> > >
> > > > Hi Taewoo,
> > > >
> > > > Thanks for the great questions.
> > > >
> > > > With regard to the distributed top-k, each data partition will return
> > its
> > > > top-k result (if it has at least k records) and then we get the global
> > > > top-k based on these local ones (somewhat similar to group by).
> > > >
> > > > With regard to the memory usage, the static part has to remain in the
> > > > memory and our experiments have showed that its size is not that large
> > > > compared to the size of the data (for 18GB of data stored in one data
> > > > partition,1 Million records each with an embedding with the size of 960
> > > > dimensions the static part takes 11MB of memory). The reason that our
> > > index
> > > > is not memory hungry is that we only have embeddings in the static
> > part,
> > > > the data pages in the dynamic part where the records will be inserted
> > > does
> > > > not store the embedding of the record, instead it stores its distance
> > to
> > > > the cluster’s centroid. We will later on explore storing the quantized
> > > > vectors for each record (helps reducing execution time by sending
> > lesser
> > > > records to the primary index for distance calculations) and that might
> > > > change the size of the dynamic section. It is important to note that
> > each
> > > > time a new memory component is created the static structure is copied
> > > into
> > > > the memory component and the dynamic part will be filled with the
> > > incoming
> > > > data.
> > > >
> > > >
> > > >
> > > > Best,
> > > > Shiva
> > > >
> > > > Shiva Jahangiri
> > > > Assistant Professor in Computer Science and Engineering Department
> > > > Santa Clara University
> > > >
> > > >
> > > >
> > > > On Tue, Jan 13, 2026 at 3:24 PM Taewoo Kim <[email protected]> wrote:
> > > >
> > > > > Hi Shiva,
> > > > >
> > > > > This proposal looks good.
> > > > >
> > > > > I have two questions (sorry if I missed)
> > > > >
> > > > > How are we going to handle distributed execution when dealing with
> > > top-K
> > > > > ANN?
> > > > > How does the memory component look like in terms of configurable
> > size?
> > > My
> > > > > naive understanding is that Vector index itself is memory-hungry.
> > > > >
> > > > > Best,
> > > > > Taewoo
> > > > >
> > > > >
> > > > > On Tue, Jan 13, 2026 at 2:04 PM Shiva Jahangiri <[email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Initiating discussion to add vector index in AsterixDB to support
> > > > > > approximate nearest neighbor (ANN) queries.
> > > > > >
> > > > > >  Feature: Adding Vector Index
> > > > > >
> > > > > > Details: Currently AsterixDB does not support approximate nearest
> > > > queries
> > > > > > and similarity search on vector embeddings. This proposal suggests
> > > the
> > > > > > first design of a tree-based vector indexing supporting top-k ANN
> > > > queries
> > > > > > which is fully compatible with LSM structure of AsterixDB's
> > storage.
> > > As
> > > > > > part of this proposal we provide support for :
> > > > > >
> > > > > > * Adding vector distance functions to support K-Nearest Neighbor
> > > (KNN)
> > > > > > queries
> > > > > > * Adding vector index to support ANN queries
> > > > > > * Adding support for INCLUDE fields in vector index to better
> > support
> > > > > > filtered similarity search.
> > > > > >
> > > > > > APE:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*31*3A*Vector*Index__;KyUrKw!!MLMg-p0Z!D1Zmu-mN_byA8sV_P_p7aLlbYJIg0b19njsPeaMkVTovtWeW0IsD-CIgOo0MJ7_7t3pZsk63GqP6lfK1$
> > > > > >
> > > > > > Thanks,
> > > > > > Shiva
> > > > > >
> > > > > > --
> > > > > > Shiva Jahangiri
> > > > > > Assistant Professor in Computer Science and Engineering Department
> > > > > > Santa Clara University
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Shiva Jahangiri
> > Assistant Professor in Computer Science and Engineering Department
> > Santa Clara University
> >

Reply via email to