This work sounds interesting, I would recommend decoupling the types from the ANN support as the types require client changes and can go in now (would give a lot of breathing room to get this ready for 5.0), where as ANN depends on SAI which is still being worked on.
> On Apr 22, 2023, at 1:02 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > > If you're interested in playing with HNSW outside of a super alpha Cassandra > branch, I put up a repo with some sample code here: > https://github.com/jbellis/hnswdemo > > On Fri, Apr 21, 2023 at 4:19 PM Jonathan Ellis <jbel...@gmail.com > <mailto:jbel...@gmail.com>> wrote: >> Happy Friday, everyone! >> >> Rich text formatting ahead, I've attached a PDF for those who prefer that. >> >> I propose adding approximate nearest neighbor (ANN) vector search capability >> to Apache Cassandra via storage-attached indexes (SAI). This is a >> medium-sized effort that will significantly enhance Cassandra’s >> functionality, particularly for AI use cases. This addition will not only >> provide a new and important feature for existing Cassandra users, but also >> attract new users to the community from the AI space, further expanding >> Cassandra’s reach and relevance. >> Introduction >> Vector search is a powerful document search technique that enables >> developers to quickly find relevant content within an extensive collection >> of documents, which is useful as a standalone technique, but it is >> particularly hot now because it significantly enhances the performance of >> LLMs. >> >> Vector search uses ML models to match the semantics of a question rather >> than just the words it contains, avoiding the classic false positives and >> false negatives associated with term-based search. Alessandro Benedetti >> gives some good examples in his excellent talk >> <https://www.youtube.com/watch?v=z-i8mOHAhlU>: >> >> <image.png> >> >> <image.png> >> >> You can search across any set of vectors, which are just ordered sets of >> numbers. In the context of natural language queries and document search, we >> are specifically concerned with a type of vector called an embedding. >> >> An embedding is a high-dimensional vector that captures the underlying >> semantic relationships and contextual information of words or phrases. >> Embeddings are generated by ML models trained for this purpose; OpenAI >> provides an API to do this, but open-source and self-hostable models like >> BERT are also popular. Creating more accurate and smaller embeddings are >> active research areas in ML. >> >> Large language models (LLMs) can be described as a mile wide and an inch >> deep. They are not experts on any narrow domain (although they will >> hallucinate that they are, sometimes convincingly). You can remedy this by >> giving the LLM additional context for your query, but the context window is >> small (4k tokens for GPT-3.5, up to 32k for GPT-4), so you want to be very >> selective about giving the LLM the most relevant possible information. >> >> Vector search is red-hot now because it allows us to easily answer the >> question “what are the most relevant documents to provide as context” by >> performing a similarity search between the embeddings vector of the query, >> and those of your document universe. Doing exact search is prohibitively >> expensive, since you necessarily have to compare with each and every >> document; this is intractable when you have billions or trillions of docs. >> However, there are well-understood algorithms for turning this into a >> logarithmic problem if you are willing to accept approximately the most >> similar documents. This is the “approximate nearest neighbor” problem. >> (You will see these referred to as kNN – k nearest neighbors – or ANN.) >> >> Pinecone DB has a good example of what this looks like in Python code >> <https://docs.pinecone.io/docs/gen-qa-openai>. >> >> Vector search is the foundation underlying effectively all of the AI >> applications that are launching now. This is particularly relevant to >> Apache Cassandra users, who tend to manage the types of large datasets that >> benefit the most from fast similarity search. Adding vector search to >> Cassandra’s unique strengths of scale, reliability, and low latency, will >> further enhance its appeal and effectiveness for these users while also >> making it more attractive to newcomers looking to harness AI’s potential. >> The faster we deliver vector search, the more valuable it will be for this >> expanding user base. >> Requirements >> Perform vector search as outlined in the Pinecone example above >> Support Float32 embeddings in the form of a new DENSE FLOAT32 cql type >> This is also useful for “classic” ML applications that derive and serve >> their own feature vectors >> Add ANN (approximate nearest neighbor) search. >> Work with normal Cassandra data flow >> Inserting one row at a time is fine; cannot require batch ingest >> Updating, deleting rows is also fine >> Must compose with other SAI predicates as well as partition keys >> Not requirements >> Other datatypes besides Float32 >> Pinecone supports only Float32 and it’s hugely ahead in mindshare so let’s >> make things easy on ourselves and follow their precedent. >> I don’t want to scope creep beyond ANN. In particular, I don’t want to wait >> for ORDER BY to get exact search in as well. >> How we can do this >> There is exactly one production quality, actively maintained, >> state-of-the-art ANN library for Java: Lucene’s HNSW. Elastic has been GA >> with ANN search backed by this library for just over a year; Solr added it >> not long after. As usual with Lucene code, it is not super concerned with >> making it usable by third parties like Cassandra, but HNSW offers public >> APIs that are low level enough that we don’t have to bring in Lucene baggage >> that we don’t want. >> >> The SAI framework is flexible enough to allow plugging in an HNSW vector >> index while enjoying the baseline benefits of SAI and playing nice with the >> other SAI index types. >> What I have today >> I have a pre-alpha branch that demonstrates >> A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type (internally). >> A new SAI index, VectorMemtableIndex >> A new CQL operator, ANN >> >> So this works: >> cqlsh> create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', >> 'replication_factor': 1}; >> cqlsh> use test; >> >> cqlsh:test> create table foo(i int primary key, j dense float32); >> cqlsh:test> create custom index ann_index on foo(j) using >> 'StorageAttachedIndex'; >> >> cqlsh:test> insert into foo (i, j) values (1, [8, 2.3, 58]); >> cqlsh:test> insert into foo (i, j) values (2, [1.2, 3.4, 5.6]); >> cqlsh:test> insert into foo (i, j) values (5, [23, 18, 3.9]); >> cqlsh:test> select * from foo where j ann [3.4, 7.8, 9.1] limit 1; >> >> i | j >> ---+--------------------------------------------------------- >> 5 | b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a' >> >> Anything else you can imagine asking "does it do X" the answer is no, it >> doesn't. It doesn't save to disk, it doesn't compose with other predicates, >> it doesn't have driver support. That last is why the answer comes back as >> raw hex bytes. The CQL parser handling the inserts is server-side and >> doesn't need driver support, is why we can do inserts and queries at all. >> >> (If you’re tempted to try this out, remember that “it doesn’t save to disk” >> means that if you do inserts first and then create the index, instead of the >> other way around, it will break, because we flush existing data as part of >> adding a new index to a table. So it’s quite fragile at the moment!) >> >> The branch is public at https://github.com/datastax/cassandra/tree/vsearch >> <https://github.com/jbellis/cassandra/tree/vsearch> >> I based it on the DataStax converged cassandra repo because that’s the only >> place to get fully working SAI for now. We plan to upstream vector search >> with the rest of SAI as soon as it’s ready. >> What’s next >> I think we can get to alpha pretty quickly: >> Add support for flushing, and serving queries from disk without reading the >> whole graph into memory (Lucene does this already, we will probably have to >> rewrite a bunch of the code for use in Cassandra, but the design is vetted >> and works). >> Make sure that scatter/gather works across nodes, this should Just Work but >> I’ve already needed to tweak a couple things that I thought would just work, >> so I’m calling this out. >> And then to be feature complete we would want to add: >> CQL support for specifying the dimensionality of a column up front, e.g. >> DENSE FLOAT32(1500). >> Pinecone accepts the first vector as the Correct dimensions, and throws an >> error if you give it one that doesn’t match, but this doesn’t work in a >> distributed system where the second vector might go to a different node than >> the first. Elastic requires specifying it up front like I propose here. >> Driver support >> Support for updating the vector in a row that hasn’t been flushed yet, >> either by updating HNSW to support deletes, or by adding some kind of >> invalidation marker to the overwritten vector. >> More performant inserts. Currently I have a big synchronized lock around >> the HNSW graph. So we either need to shard it, like we do for the other SAI >> indexes, or add fine-grained locking like ConcurrentSkipListMap to make >> HnswGraphBuilder concurrent-ish. I prefer the second option, since it >> allows us to avoid building the graph once in memory and then a second time >> on flush, but it might be more work than it appears. >> Add index configurability options: similarity function and HNSW parameters M >> and ef. >> Expose the similarity functions to CQL so you can SELECT x, >> cosine_similarity(x, query_vector) FROM … >> Special thanks to >> Mike Adamson and Zhao Yang made substantial contributions to the branch you >> see here and provided indispensable help understanding SAI. >> > > > -- > Jonathan Ellis > co-founder, http://www.datastax.com <http://www.datastax.com/> > @spyced