*Hi all,I have a branch of vector search based on cep-7-sai at
https://github.com/datastax/cassandra/tree/cep-vsearch
<https://github.com/datastax/cassandra/tree/cep-vsearch>. Compared to the
original POC branch, this one is based on the SAI code that will be
mainline soon, and handles distributed scatter/gather. Updates and deletes
to vector values are still not supported.I also put together a demo that
uses this branch to provide context to OpenAI’s GPT, available here:
https://github.com/jbellis/cassgpt
<https://github.com/jbellis/cassgpt>. Here is the query that gets
executed: SELECT id, start, end, text FROM
{self.keyspace}.{self.table} WHERE embedding ANN OF %s LIMIT %sThe
more I used the proposed `ANN OF` syntax, the less I liked it. This is
because we don’t want an actual boolean predicate; we just want to order
results. Put another way, `ANN OF` will include all rows of the table
given a high enough `LIMIT`, and that makes it a bad fit for expression
processing that expects to be able to filter out rows before it starts
LIMIT-ing. And in fact the code to support executing the query looks
suspiciously like what you’d want for `ORDER BY`.I propose that we adopt
`ORDER BY` syntax, supporting it for vector indexes first and eventually
for all SAI indexes. So this query would become SELECT id, start, end,
text FROM {self.keyspace}.{self.table} ORDER BY embedding ANN OF
%s LIMIT %sAnd it would compose with other SAI indexes with syntax
like SELECT id, start, end, text FROM
{self.keyspace}.{self.table} WHERE publish_date > %s ORDER BY
embedding ANN OF %s LIMIT %sRelated work:This is similar to the
approach used by pgvector, except they invented the symbolic operator `<->`
that has the same semantics as `ANN OF`. I am okay with adopting their
operator, but I think ANN OF is more readable.*
--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced