Re: [DISCUSS][APE] Adding Vector Index

Mike Carey Fri, 30 Jan 2026 12:13:23 -0800

Likewise!  +1

On Fri, Jan 30, 2026 at 9:58 AM Ian Maxon <[email protected]> wrote:


> +1 to this APE. Supporting ANN queries using a vector index,
> especially a novel one like this, is awesome.
> I think there are still some loose ends about the distance functions
> and some of the exact form of the WITH clauses but these are minor
> details; I don't think they need to block acceptance.
>
> On Wed, Jan 14, 2026 at 11:57 AM Taewoo Kim <[email protected]> wrote:
> >
> > Thanks for the clarification!
> >
> > Best,
> > Taewoo
> >
> >
> > On Wed, Jan 14, 2026 at 11:36 AM Shiva Jahangiri <[email protected]>
> wrote:
> >
> > > Hi Taewoo,
> > >
> > > Thanks! So each data partition will have its own vector index as
> secondary
> > > index, and so each data partition does its own sampling of data,
> creates
> > > its own static structure, etc. using its own data. That means there is
> no
> > > overlapping or connection between the data or static structure of
> vector
> > > indexes of different data partitions.
> > >
> > > For top-k, we basically ask each data partition to give us their top-k
> > > results. If we have N data partitions, we will get N*k results which
> then
> > > in a global step we get its top-k results out.
> > >
> > > Each node can have multiple data partitions which makes it simple in
> the
> > > shared-nothing architecture. In the cloud mode, the code is modified to
> > > make sure that the search goes through all data partitions even if
> multiple
> > > of them are managed by a single compute node.
> > >
> > > Best,
> > > Shiva
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Jan 14, 2026 at 10:39 AM Taewoo Kim <[email protected]>
> wrote:
> > >
> > > > Hi Shiva,
> > > >
> > > > Thanks for your reply.
> > > >
> > > > Somehow I got confused about TOP-K. I thought each partition could
> have
> > > an
> > > > overlapping portion from the static part. So, will each partition be
> > > > processed on a single node?
> > > >
> > > > Regarding the memory part, I'm glad to know that the size is not that
> > > huge.
> > > > :-)
> > > >
> > > > Best,
> > > > Taewoo
> > > >
> > > >
> > > > On Wed, Jan 14, 2026 at 9:25 AM Shiva Jahangiri <[email protected]>
> > > > wrote:
> > > >
> > > > > Hi Taewoo,
> > > > >
> > > > > Thanks for the great questions.
> > > > >
> > > > > With regard to the distributed top-k, each data partition will
> return
> > > its
> > > > > top-k result (if it has at least k records) and then we get the
> global
> > > > > top-k based on these local ones (somewhat similar to group by).
> > > > >
> > > > > With regard to the memory usage, the static part has to remain in
> the
> > > > > memory and our experiments have showed that its size is not that
> large
> > > > > compared to the size of the data (for 18GB of data stored in one
> data
> > > > > partition,1 Million records each with an embedding with the size
> of 960
> > > > > dimensions the static part takes 11MB of memory). The reason that
> our
> > > > index
> > > > > is not memory hungry is that we only have embeddings in the static
> > > part,
> > > > > the data pages in the dynamic part where the records will be
> inserted
> > > > does
> > > > > not store the embedding of the record, instead it stores its
> distance
> > > to
> > > > > the cluster’s centroid. We will later on explore storing the
> quantized
> > > > > vectors for each record (helps reducing execution time by sending
> > > lesser
> > > > > records to the primary index for distance calculations) and that
> might
> > > > > change the size of the dynamic section. It is important to note
> that
> > > each
> > > > > time a new memory component is created the static structure is
> copied
> > > > into
> > > > > the memory component and the dynamic part will be filled with the
> > > > incoming
> > > > > data.
> > > > >
> > > > >
> > > > >
> > > > > Best,
> > > > > Shiva
> > > > >
> > > > > Shiva Jahangiri
> > > > > Assistant Professor in Computer Science and Engineering Department
> > > > > Santa Clara University
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jan 13, 2026 at 3:24 PM Taewoo Kim <[email protected]>
> wrote:
> > > > >
> > > > > > Hi Shiva,
> > > > > >
> > > > > > This proposal looks good.
> > > > > >
> > > > > > I have two questions (sorry if I missed)
> > > > > >
> > > > > > How are we going to handle distributed execution when dealing
> with
> > > > top-K
> > > > > > ANN?
> > > > > > How does the memory component look like in terms of configurable
> > > size?
> > > > My
> > > > > > naive understanding is that Vector index itself is memory-hungry.
> > > > > >
> > > > > > Best,
> > > > > > Taewoo
> > > > > >
> > > > > >
> > > > > > On Tue, Jan 13, 2026 at 2:04 PM Shiva Jahangiri <
> [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Initiating discussion to add vector index in AsterixDB to
> support
> > > > > > > approximate nearest neighbor (ANN) queries.
> > > > > > >
> > > > > > >  Feature: Adding Vector Index
> > > > > > >
> > > > > > > Details: Currently AsterixDB does not support approximate
> nearest
> > > > > queries
> > > > > > > and similarity search on vector embeddings. This proposal
> suggests
> > > > the
> > > > > > > first design of a tree-based vector indexing supporting top-k
> ANN
> > > > > queries
> > > > > > > which is fully compatible with LSM structure of AsterixDB's
> > > storage.
> > > > As
> > > > > > > part of this proposal we provide support for :
> > > > > > >
> > > > > > > * Adding vector distance functions to support K-Nearest
> Neighbor
> > > > (KNN)
> > > > > > > queries
> > > > > > > * Adding vector index to support ANN queries
> > > > > > > * Adding support for INCLUDE fields in vector index to better
> > > support
> > > > > > > filtered similarity search.
> > > > > > >
> > > > > > > APE:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*31*3A*Vector*Index__;KyUrKw!!MLMg-p0Z!D1Zmu-mN_byA8sV_P_p7aLlbYJIg0b19njsPeaMkVTovtWeW0IsD-CIgOo0MJ7_7t3pZsk63GqP6lfK1$
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Shiva
> > > > > > >
> > > > > > > --
> > > > > > > Shiva Jahangiri
> > > > > > > Assistant Professor in Computer Science and Engineering
> Department
> > > > > > > Santa Clara University
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Shiva Jahangiri
> > > Assistant Professor in Computer Science and Engineering Department
> > > Santa Clara University
> > >
>

Re: [DISCUSS][APE] Adding Vector Index

Reply via email to