*Hi all,I have a branch of vector search based on cep-7-sai at
https://github.com/datastax/cassandra/tree/cep-vsearch
<https://github.com/datastax/cassandra/tree/cep-vsearch>. Compared to the
original POC branch, this one is based on the SAI code that will be
mainline soon, and handles distributed scatter/gather.  Updates and deletes
to vector values are still not supported.I also put together a demo that
uses this branch to provide context to OpenAI’s GPT, available here:
https://github.com/jbellis/cassgpt
<https://github.com/jbellis/cassgpt>.  Here is the query that gets
executed:    SELECT id, start, end, text     FROM
{self.keyspace}.{self.table}     WHERE embedding ANN OF %s     LIMIT %sThe
more I used the proposed `ANN OF` syntax, the less I liked it.  This is
because we don’t want an actual boolean predicate; we just want to order
results.  Put another way, `ANN OF` will include all rows of the table
given a high enough `LIMIT`, and that makes it a bad fit for expression
processing that expects to be able to filter out rows before it starts
LIMIT-ing.  And in fact the code to support executing the query looks
suspiciously like what you’d want for `ORDER BY`.I propose that we adopt
`ORDER BY` syntax, supporting it for vector indexes first and eventually
for all SAI indexes.  So this query would become    SELECT id, start, end,
text     FROM {self.keyspace}.{self.table}     ORDER BY embedding ANN OF
%s     LIMIT %sAnd it would compose with other SAI indexes with syntax
like    SELECT id, start, end, text     FROM
{self.keyspace}.{self.table}     WHERE publish_date > %s    ORDER BY
embedding ANN OF %s     LIMIT %sRelated work:This is similar to the
approach used by pgvector, except they invented the symbolic operator `<->`
that has the same semantics as `ANN OF`.  I am okay with adopting their
operator, but I think ANN OF is more readable.*
-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Reply via email to