To be fair Dinesh kind of primed that: > Do you intend to make this part of CEP-7 or as an incremental update to SAI > once it is committed? ;)
I think this body of work more than stands on its own. Great work Jonathan, Mike, and Zhao; having native support for more ML-oriented workloads in C* would be a big win for a bunch of our users and plays into our architectural strengths in a lot of ways too. On Tue, Apr 25, 2023, at 7:35 PM, Henrik Ingo wrote: > Jonathan what a great proposal/code. An enjoyable read. And at least for me > educational! (Which is notable, as you're on my turf, I'm a Data Science > major.) > > Sorry for splitting hairs but CEP-7 (as a spec, and wiki page) is approved > and voted on and I assume there's no proposal to change that. That said, work > of course continues beyond CEP-7 and this is not the only SAI feature that > adds on top of the CEP-7 foundation. > > I just wanted to clarify so there's no confusion later. > > henrik > > On Sat, Apr 22, 2023 at 10:41 PM Jonathan Ellis <jbel...@gmail.com> wrote: >> My guess is that I will be able to get this ready to upstream before the >> rest of CEP-7 goes in, so it would make sense to me to roll it into that. >> >> On Fri, Apr 21, 2023 at 5:34 PM Dinesh Joshi <djo...@apache.org> wrote: >>> Interesting proposal Jonathan. Will grok it over the weekend and play >>> around with the branch. >>> >>> Do you intend to make this part of CEP-7 or as an incremental update to SAI >>> once it is committed? >>> >>>> On Apr 21, 2023, at 2:19 PM, Jonathan Ellis <jbel...@gmail.com> wrote: >>>> >>>> Happy Friday, everyone! >>>> >>>> Rich text formatting ahead, I've attached a PDF for those who prefer that. >>>> ** >>>> >>>> I propose adding approximate nearest neighbor (ANN) vector search >>>> capability to Apache Cassandra via storage-attached indexes (SAI). This is >>>> a medium-sized effort that will significantly enhance Cassandra’s >>>> functionality, particularly for AI use cases. This addition will not only >>>> provide a new and important feature for existing Cassandra users, but also >>>> attract new users to the community from the AI space, further expanding >>>> Cassandra’s reach and relevance. >>>> Introduction >>>> >>>> Vector search is a powerful document search technique that enables >>>> developers to quickly find relevant content within an extensive collection >>>> of documents, which is useful as a standalone technique, but it is >>>> particularly hot now because it significantly enhances the performance of >>>> LLMs. >>>> >>>> Vector search uses ML models to match the semantics of a question rather >>>> than just the words it contains, avoiding the classic false positives and >>>> false negatives associated with term-based search. Alessandro Benedetti >>>> gives some good examples in his _excellent talk_ >>>> <https://urldefense.com/v3/__https://www.youtube.com/watch?v=z-i8mOHAhlU__;!!PbtH5S7Ebw!fGOkCfeX4xWlavpoBz8Q24N4rJ1QKFlHAqMhO6h89IfCtjhQJP2Es0k7PAEGKIfU49s0O7KE0tA5V5bm-g$>: >>>> <image.png> >>>> >>>> <image.png> >>>> >>>> You can search across any set of vectors, which are just ordered sets of >>>> numbers. In the context of natural language queries and document search, >>>> we are specifically concerned with a type of vector called an *embedding*. >>>> >>>> >>>> An embedding is a high-dimensional vector that captures the underlying >>>> semantic relationships and contextual information of words or phrases. >>>> Embeddings are generated by ML models trained for this purpose; OpenAI >>>> provides an API to do this, but open-source and self-hostable models like >>>> BERT are also popular. Creating more accurate and smaller embeddings are >>>> active research areas in ML. >>>> >>>> Large language models (LLMs) can be described as a mile wide and an inch >>>> deep. They are not experts on any narrow domain (although they will >>>> hallucinate that they are, sometimes convincingly). You can remedy this >>>> by giving the LLM additional context for your query, but the context >>>> window is small (4k tokens for GPT-3.5, up to 32k for GPT-4), so you want >>>> to be very selective about giving the LLM the most relevant possible >>>> information. >>>> >>>> Vector search is red-hot now because it allows us to easily answer the >>>> question “what are the most relevant documents to provide as context” by >>>> performing a similarity search between the embeddings vector of the query, >>>> and those of your document universe. Doing exact search is prohibitively >>>> expensive, since you necessarily have to compare with each and every >>>> document; this is intractable when you have billions or trillions of docs. >>>> However, there are well-understood algorithms for turning this into a >>>> logarithmic problem if you are willing to accept *approximately *the most >>>> similar documents. This is the “approximate nearest neighbor” problem. >>>> (You will see these referred to as kNN – k nearest neighbors – or ANN.) >>>> >>>> _Pinecone DB has a good example of what this looks like in Python code_ >>>> <https://urldefense.com/v3/__https://docs.pinecone.io/docs/gen-qa-openai__;!!PbtH5S7Ebw!fGOkCfeX4xWlavpoBz8Q24N4rJ1QKFlHAqMhO6h89IfCtjhQJP2Es0k7PAEGKIfU49s0O7KE0tD9egmitQ$>. >>>> >>>> Vector search is the foundation underlying effectively all of the AI >>>> applications that are launching now. This is particularly relevant to >>>> Apache Cassandra users, who tend to manage the types of large datasets >>>> that benefit the most from fast similarity search. Adding vector search to >>>> Cassandra’s unique strengths of scale, reliability, and low latency, will >>>> further enhance its appeal and effectiveness for these users while also >>>> making it more attractive to newcomers looking to harness AI’s potential. >>>> The faster we deliver vector search, the more valuable it will be for this >>>> expanding user base. >>>> Requirements >>>> >>>> 1. Perform vector search as outlined in the Pinecone example above >>>> 1. Support Float32 embeddings in the form of a new DENSE FLOAT32 cql >>>> type >>>> 1. This is also useful for “classic” ML applications that derive and >>>> serve their own feature vectors >>>> 2. Add ANN (approximate nearest neighbor) search. >>>> 2. Work with normal Cassandra data flow >>>> 1. Inserting one row at a time is fine; cannot require batch ingest >>>> 2. Updating, deleting rows is also fine >>>> 3. Must compose with other SAI predicates as well as partition keys >>>> Not requirements >>>> >>>> 1. Other datatypes besides Float32 >>>> 1. *Pinecone supports only Float32 and it’s hugely ahead in mindshare >>>> so let’s make things easy on ourselves and follow their precedent.* >>>> 2. I don’t want to scope creep beyond ANN. In particular, I don’t want to >>>> wait for ORDER BY to get exact search in as well. >>>> How we can do this >>>> >>>> There is exactly one production quality, actively maintained, >>>> state-of-the-art ANN library for Java: Lucene’s HNSW. Elastic has been GA >>>> with ANN search backed by this library for just over a year; Solr added it >>>> not long after. As usual with Lucene code, it is not super concerned with >>>> making it usable by third parties like Cassandra, but HNSW offers public >>>> APIs that are low level enough that we don’t have to bring in Lucene >>>> baggage that we don’t want. >>>> >>>> The SAI framework is flexible enough to allow plugging in an HNSW vector >>>> index while enjoying the baseline benefits of SAI and playing nice with >>>> the other SAI index types. >>>> What I have today >>>> >>>> I have a pre-alpha branch that demonstrates >>>> 1. A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type (internally). >>>> 2. A new SAI index, VectorMemtableIndex >>>> 3. A new CQL operator, ANN >>>> >>>> So this works: >>>> cqlsh> create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', >>>> 'replication_factor': 1}; >>>> cqlsh> use test; >>>> >>>> cqlsh:test> create table foo(i int primary key, j dense float32); >>>> cqlsh:test> create custom index ann_index on foo(j) using >>>> 'StorageAttachedIndex'; >>>> >>>> cqlsh:test> insert into foo (i, j) values (1, [8, 2.3, 58]); >>>> cqlsh:test> insert into foo (i, j) values (2, [1.2, 3.4, 5.6]); >>>> cqlsh:test> insert into foo (i, j) values (5, [23, 18, 3.9]); >>>> cqlsh:test> select * from foo where j ann [3.4, 7.8, 9.1] limit 1; >>>> >>>> i | j >>>> ---+--------------------------------------------------------- >>>> 5 | b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a' >>>> >>>> Anything else you can imagine asking "does it do X" the answer is no, it >>>> doesn't. It doesn't save to disk, it doesn't compose with other >>>> predicates, it doesn't have driver support. That last is why the answer >>>> comes back as raw hex bytes. The CQL parser handling the inserts is >>>> server-side and doesn't need driver support, is why we can do inserts and >>>> queries at all. >>>> >>>> (If you’re tempted to try this out, remember that “it doesn’t save to >>>> disk” means that if you do inserts first and then create the index, >>>> instead of the other way around, it will break, because we flush existing >>>> data as part of adding a new index to a table. So it’s quite fragile at >>>> the moment!) >>>> >>>> The branch is public at >>>> _https://github.com/datastax/cassandra/tree/vsearch_ >>>> <https://github.com/jbellis/cassandra/tree/vsearch> >>>> >>>> I based it on the DataStax converged cassandra repo because that’s the >>>> only place to get fully working SAI for now. We plan to upstream vector >>>> search with the rest of SAI as soon as it’s ready. >>>> What’s next >>>> >>>> I think we can get to alpha pretty quickly: >>>> 1. Add support for flushing, and serving queries from disk without >>>> reading the whole graph into memory (Lucene does this already, we will >>>> probably have to rewrite a bunch of the code for use in Cassandra, but the >>>> design is vetted and works). >>>> 1. Make sure that scatter/gather works across nodes, this should Just >>>> Work but I’ve already needed to tweak a couple things that I thought would >>>> just work, so I’m calling this out. >>>> And then to be feature complete we would want to add: >>>> 1. CQL support for specifying the dimensionality of a column up front, >>>> e.g. DENSE FLOAT32(1500). >>>> 1. *Pinecone accepts the first vector as the Correct dimensions, and >>>> throws an error if you give it one that doesn’t match, but this doesn’t >>>> work in a distributed system where the second vector might go to a >>>> different node than the first. Elastic requires specifying it up front >>>> like I propose here.* >>>> 2. Driver support >>>> 3. Support for updating the vector in a row that hasn’t been flushed yet, >>>> either by updating HNSW to support deletes, or by adding some kind of >>>> invalidation marker to the overwritten vector. >>>> 4. More performant inserts. Currently I have a big synchronized lock >>>> around the HNSW graph. So we either need to shard it, like we do for the >>>> other SAI indexes, or add fine-grained locking like ConcurrentSkipListMap >>>> to make HnswGraphBuilder concurrent-ish. I prefer the second option, >>>> since it allows us to avoid building the graph once in memory and then a >>>> second time on flush, but it might be more work than it appears. >>>> 5. Add index configurability options: similarity function and HNSW >>>> parameters M and ef. >>>> 6. Expose the similarity functions to CQL so you can SELECT x, >>>> cosine_similarity(x, query_vector) FROM … >>>> Special thanks to >>>> >>>> Mike Adamson and Zhao Yang made substantial contributions to the branch >>>> you see here and provided indispensable help understanding SAI. >>>> >>>> <Vector search for Cassandra.pdf> >>> >> >> >> -- >> Jonathan Ellis >> co-founder, http://www.datastax.com >> @spyced > > > -- > > > > > *Henrik Ingo* > > *c*. +358 40 569 7354 > > *w*. _www.datastax.com_ > > __ <https://www.facebook.com/datastax> __ <https://twitter.com/datastax> __ > <https://www.linkedin.com/company/datastax/> __ > <https://github.com/datastax/> >