+1 to the flow of: 1: ORDER BY?
2: Oh. Yeah. That *does *makes sense. ;) (sending from fastmail in the hopes the image doesn't get stripped. Thanks ASF smtp server...) ~Josh On Wed, May 24, 2023, at 1:00 AM, Jeremiah D Jordan wrote: > At first I wasn’t sure about using ORDER BY, but the more I think about what > is actually going on, I think it does make sense. > > This also matches up with some ideas that have been floating around about > being able to ORDER BY a sorted SAI index. > > -Jeremiah > >> On May 22, 2023, at 2:28 PM, Jonathan Ellis <jbel...@gmail.com> wrote: >> >> Hi all, >> >> I have a branch of vector search based on cep-7-sai at >> _https://github.com/datastax/cassandra/tree/cep-vsearch_. Compared to the >> original POC branch, this one is based on the SAI code that will be mainline >> soon, and handles distributed scatter/gather. Updates and deletes to vector >> values are still not supported. >> >> I also put together a demo that uses this branch to provide context to >> OpenAI’s GPT, available here: _https://github.com/jbellis/cassgpt_. >> >> Here is the query that gets executed: >> >> SELECT id, start, end, text >> FROM {self.keyspace}.{self.table} >> WHERE embedding ANN OF %s >> LIMIT %s >> >> The more I used the proposed `ANN OF` syntax, the less I liked it. This is >> because we don’t want an actual boolean predicate; we just want to order >> results. Put another way, `ANN OF` will include all rows of the table given >> a high enough `LIMIT`, and that makes it a bad fit for expression processing >> that expects to be able to filter out rows before it starts LIMIT-ing. And >> in fact the code to support executing the query looks suspiciously like what >> you’d want for `ORDER BY`. >> >> I propose that we adopt `ORDER BY` syntax, supporting it for vector indexes >> first and eventually for all SAI indexes. So this query would become >> >> SELECT id, start, end, text >> FROM {self.keyspace}.{self.table} >> ORDER BY embedding ANN OF %s >> LIMIT %s >> >> And it would compose with other SAI indexes with syntax like >> >> SELECT id, start, end, text >> FROM {self.keyspace}.{self.table} >> WHERE publish_date > %s >> ORDER BY embedding ANN OF %s >> LIMIT %s >> >> Related work: >> >> This is similar to the approach used by pgvector, except they invented the >> symbolic operator `<->` that has the same semantics as `ANN OF`. I am okay >> with adopting their operator, but I think ANN OF is more readable. >> >> -- >> Jonathan Ellis >> co-founder, http://www.datastax.com >> @spyced