My guess is that I will be able to get this ready to upstream before the rest of CEP-7 goes in, so it would make sense to me to roll it into that.
On Fri, Apr 21, 2023 at 5:34 PM Dinesh Joshi <djo...@apache.org> wrote: > Interesting proposal Jonathan. Will grok it over the weekend and play > around with the branch. > > Do you intend to make this part of CEP-7 or as an incremental update to > SAI once it is committed? > > On Apr 21, 2023, at 2:19 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > > > > *Happy Friday, everyone!Rich text formatting ahead, I've attached a PDF > for those who prefer that.* > > *I propose adding approximate nearest neighbor (ANN) vector search > capability to Apache Cassandra via storage-attached indexes (SAI). This is > a medium-sized effort that will significantly enhance Cassandra’s > functionality, particularly for AI use cases. This addition will not only > provide a new and important feature for existing Cassandra users, but also > attract new users to the community from the AI space, further expanding > Cassandra’s reach and relevance.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *IntroductionVector search is a powerful document search technique that > enables developers to quickly find relevant content within an extensive > collection of documents, which is useful as a standalone technique, but it > is particularly hot now because it significantly enhances the performance > of LLMs.Vector search uses ML models to match the semantics of a question > rather than just the words it contains, avoiding the classic false > positives and false negatives associated with term-based search. > Alessandro Benedetti gives some good examples in his excellent talk > <https://www.youtube.com/watch?v=z-i8mOHAhlU>:<image.png><image.png>You can > search across any set of vectors, which are just ordered sets of numbers. > In the context of natural language queries and document search, we are > specifically concerned with a type of vector called an embedding. An > embedding is a high-dimensional vector that captures the underlying > semantic relationships and contextual information of words or phrases. > Embeddings are generated by ML models trained for this purpose; OpenAI > provides an API to do this, but open-source and self-hostable models like > BERT are also popular. Creating more accurate and smaller embeddings are > active research areas in ML.Large language models (LLMs) can be described > as a mile wide and an inch deep. They are not experts on any narrow domain > (although they will hallucinate that they are, sometimes convincingly). > You can remedy this by giving the LLM additional context for your query, > but the context window is small (4k tokens for GPT-3.5, up to 32k for > GPT-4), so you want to be very selective about giving the LLM the most > relevant possible information.Vector search is red-hot now because it > allows us to easily answer the question “what are the most relevant > documents to provide as context” by performing a similarity search between > the embeddings vector of the query, and those of your document universe. > Doing exact search is prohibitively expensive, since you necessarily have > to compare with each and every document; this is intractable when you have > billions or trillions of docs. However, there are well-understood > algorithms for turning this into a logarithmic problem if you are willing > to accept approximately the most similar documents. This is the > “approximate nearest neighbor” problem. (You will see these referred to as > kNN – k nearest neighbors – or ANN.)Pinecone DB has a good example of what > this looks like in Python code > <https://docs.pinecone.io/docs/gen-qa-openai>.Vector search is the > foundation underlying effectively all of the AI applications that are > launching now. This is particularly relevant to Apache Cassandra users, > who tend to manage the types of large datasets that benefit the most from > fast similarity search. Adding vector search to Cassandra’s unique > strengths of scale, reliability, and low latency, will further enhance its > appeal and effectiveness for these users while also making it more > attractive to newcomers looking to harness AI’s potential. The faster we > deliver vector search, the more valuable it will be for this expanding user > base.Requirements 1. Perform vector search as outlined in the Pinecone > example above1. Support Float32 embeddings in the form of a new DENSE > FLOAT32 cql type1. This is also useful for “classic” ML applications that > derive and serve their own feature vectors2. Add ANN (approximate nearest > neighbor) search.2. Work with normal Cassandra data flow1. Inserting one > row at a time is fine; cannot require batch ingest2. Updating, deleting > rows is also fine3. Must compose with other SAI predicates as well as > partition keysNot requirements 1. Other datatypes besides Float321. > Pinecone supports only Float32 and it’s hugely ahead in mindshare so let’s > make things easy on ourselves and follow their precedent.2. I don’t want to > scope creep beyond ANN. In particular, I don’t want to wait for ORDER BY to > get exact search in as well.How we can do thisThere is exactly one > production quality, actively maintained, state-of-the-art ANN library for > Java: Lucene’s HNSW. Elastic has been GA with ANN search backed by this > library for just over a year; Solr added it not long after. As usual with > Lucene code, it is not super concerned with making it usable by third > parties like Cassandra, but HNSW offers public APIs that are low level > enough that we don’t have to bring in Lucene baggage that we don’t want.The > SAI framework is flexible enough to allow plugging in an HNSW vector index > while enjoying the baseline benefits of SAI and playing nice with the other > SAI index types.What I have todayI have a pre-alpha branch that > demonstrates 1. A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type > (internally).2. A new SAI index, VectorMemtableIndex3. A new CQL operator, > ANNSo this works:cqlsh> create KEYSPACE test WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 1};cqlsh> use test;cqlsh:test> > create table foo(i int primary key, j dense float32);cqlsh:test> create > custom index ann_index on foo(j) using 'StorageAttachedIndex';cqlsh:test> > insert into foo (i, j) values (1, [8, 2.3, 58]);cqlsh:test> insert into foo > (i, j) values (2, [1.2, 3.4, 5.6]);cqlsh:test> insert into foo (i, j) > values (5, [23, 18, 3.9]);cqlsh:test> select * from foo where j ann [3.4, > 7.8, 9.1] limit 1; i | > j---+--------------------------------------------------------- 5 | > b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a'Anything else you > can imagine asking "does it do X" the answer is no, it doesn't. It doesn't > save to disk, it doesn't compose with other predicates, it doesn't have > driver support. That last is why the answer comes back as raw hex bytes. > The CQL parser handling the inserts is server-side and doesn't need driver > support, is why we can do inserts and queries at all.(If you’re tempted to > try this out, remember that “it doesn’t save to disk” means that if you do > inserts first and then create the index, instead of the other way around, > it will break, because we flush existing data as part of adding a new index > to a table. So it’s quite fragile at the moment!)The branch is public at > https://github.com/datastax/cassandra/tree/vsearch > <https://github.com/jbellis/cassandra/tree/vsearch>I based it on the > DataStax converged cassandra repo because that’s the only place to get > fully working SAI for now. We plan to upstream vector search with the rest > of SAI as soon as it’s ready.What’s nextI think we can get to alpha pretty > quickly: 1. Add support for flushing, and serving queries from disk without > reading the whole graph into memory (Lucene does this already, we will > probably have to rewrite a bunch of the code for use in Cassandra, but the > design is vetted and works). 1. Make sure that scatter/gather works across > nodes, this should Just Work but I’ve already needed to tweak a couple > things that I thought would just work, so I’m calling this out.And then to > be feature complete we would want to add: 1. CQL support for specifying the > dimensionality of a column up front, e.g. DENSE FLOAT32(1500).1. Pinecone > accepts the first vector as the Correct dimensions, and throws an error if > you give it one that doesn’t match, but this doesn’t work in a distributed > system where the second vector might go to a different node than the > first. Elastic requires specifying it up front like I propose here.2. > Driver support3. Support for updating the vector in a row that hasn’t been > flushed yet, either by updating HNSW to support deletes, or by adding some > kind of invalidation marker to the overwritten vector.4. More performant > inserts. Currently I have a big synchronized lock around the HNSW graph. > So we either need to shard it, like we do for the other SAI indexes, or add > fine-grained locking like ConcurrentSkipListMap to make HnswGraphBuilder > concurrent-ish. I prefer the second option, since it allows us to avoid > building the graph once in memory and then a second time on flush, but it > might be more work than it appears.5. Add index configurability options: > similarity function and HNSW parameters M and ef.6. Expose the similarity > functions to CQL so you can SELECT x, cosine_similarity(x, query_vector) > FROM …Special thanks toMike Adamson and Zhao Yang made substantial > contributions to the branch you see here and provided indispensable help > understanding SAI.* > <Vector search for Cassandra.pdf> > > > -- Jonathan Ellis co-founder, http://www.datastax.com @spyced