To be fair Dinesh kind of primed that:

> Do you intend to make this part of CEP-7 or as an incremental update to SAI 
> once it is committed?
;)

I think this body of work more than stands on its own. Great work Jonathan, 
Mike, and Zhao; having native support for more ML-oriented workloads in C* 
would be a big win for a bunch of our users and plays into our architectural 
strengths in a lot of ways too.

On Tue, Apr 25, 2023, at 7:35 PM, Henrik Ingo wrote:
> Jonathan what a great proposal/code. An enjoyable read. And at least for me 
> educational! (Which is notable, as you're on my turf, I'm a Data Science 
> major.)
> 
> Sorry for splitting hairs but CEP-7 (as a spec, and wiki page) is approved 
> and voted on and I assume there's no proposal to change that. That said, work 
> of course continues beyond CEP-7 and this is not the only SAI feature that 
> adds on top of the CEP-7 foundation.
> 
> I just wanted to clarify so there's no confusion later.
> 
> henrik
> 
> On Sat, Apr 22, 2023 at 10:41 PM Jonathan Ellis <jbel...@gmail.com> wrote:
>> My guess is that I will be able to get this ready to upstream before the 
>> rest of CEP-7 goes in, so it would make sense to me to roll it into that.
>> 
>> On Fri, Apr 21, 2023 at 5:34 PM Dinesh Joshi <djo...@apache.org> wrote:
>>> Interesting proposal Jonathan. Will grok it over the weekend and play 
>>> around with the branch.
>>> 
>>> Do you intend to make this part of CEP-7 or as an incremental update to SAI 
>>> once it is committed?
>>> 
>>>> On Apr 21, 2023, at 2:19 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>>> 
>>>> Happy Friday, everyone!
>>>> 
>>>> Rich text formatting ahead, I've attached a PDF for those who prefer that.
>>>> **
>>>> 
>>>> I propose adding approximate nearest neighbor (ANN) vector search 
>>>> capability to Apache Cassandra via storage-attached indexes (SAI). This is 
>>>> a medium-sized effort that will significantly enhance Cassandra’s 
>>>> functionality, particularly for AI use cases. This addition will not only 
>>>> provide a new and important feature for existing Cassandra users, but also 
>>>> attract new users to the community from the AI space, further expanding 
>>>> Cassandra’s reach and relevance.
>>>> Introduction
>>>> 
>>>> Vector search is a powerful document search technique that enables 
>>>> developers to quickly find relevant content within an extensive collection 
>>>> of documents, which is useful as a standalone technique, but it is 
>>>> particularly hot now because it significantly enhances the performance of 
>>>> LLMs.
>>>> 
>>>> Vector search uses ML models to match the semantics of a question rather 
>>>> than just the words it contains, avoiding the classic false positives and 
>>>> false negatives associated with term-based search.  Alessandro Benedetti 
>>>> gives some good examples in his _excellent talk_ 
>>>> <https://urldefense.com/v3/__https://www.youtube.com/watch?v=z-i8mOHAhlU__;!!PbtH5S7Ebw!fGOkCfeX4xWlavpoBz8Q24N4rJ1QKFlHAqMhO6h89IfCtjhQJP2Es0k7PAEGKIfU49s0O7KE0tA5V5bm-g$>:
>>>> <image.png>
>>>> 
>>>> <image.png>
>>>> 
>>>> You can search across any set of vectors, which are just ordered sets of 
>>>> numbers.  In the context of natural language queries and document search, 
>>>> we are specifically concerned with a type of vector called an *embedding*. 
>>>>  
>>>> 
>>>> An embedding is a high-dimensional vector that captures the underlying 
>>>> semantic relationships and contextual information of words or phrases. 
>>>> Embeddings are generated by ML models trained for this purpose; OpenAI 
>>>> provides an API to do this, but open-source and self-hostable models like 
>>>> BERT are also popular. Creating more accurate and smaller embeddings are 
>>>> active research areas in ML.
>>>> 
>>>> Large language models (LLMs) can be described as a mile wide and an inch 
>>>> deep. They are not experts on any narrow domain (although they will 
>>>> hallucinate that they are, sometimes convincingly).  You can remedy this 
>>>> by giving the LLM additional context for your query, but the context 
>>>> window is small (4k tokens for GPT-3.5, up to 32k for GPT-4), so you want 
>>>> to be very selective about giving the LLM the most relevant possible 
>>>> information.
>>>> 
>>>> Vector search is red-hot now because it allows us to easily answer the 
>>>> question “what are the most relevant documents to provide as context” by 
>>>> performing a similarity search between the embeddings vector of the query, 
>>>> and those of your document universe.  Doing exact search is prohibitively 
>>>> expensive, since you necessarily have to compare with each and every 
>>>> document; this is intractable when you have billions or trillions of docs. 
>>>>  However, there are well-understood algorithms for turning this into a 
>>>> logarithmic problem if you are willing to accept *approximately *the most 
>>>> similar documents.  This is the “approximate nearest neighbor” problem.  
>>>> (You will see these referred to as kNN – k nearest neighbors – or ANN.)
>>>> 
>>>> _Pinecone DB has a good example of what this looks like in Python code_ 
>>>> <https://urldefense.com/v3/__https://docs.pinecone.io/docs/gen-qa-openai__;!!PbtH5S7Ebw!fGOkCfeX4xWlavpoBz8Q24N4rJ1QKFlHAqMhO6h89IfCtjhQJP2Es0k7PAEGKIfU49s0O7KE0tD9egmitQ$>.
>>>> 
>>>> Vector search is the foundation underlying effectively all of the AI 
>>>> applications that are launching now.  This is particularly relevant to 
>>>> Apache Cassandra users, who tend to manage the types of large datasets 
>>>> that benefit the most from fast similarity search. Adding vector search to 
>>>> Cassandra’s unique strengths of scale, reliability, and low latency, will 
>>>> further enhance its appeal and effectiveness for these users while also 
>>>> making it more attractive to newcomers looking to harness AI’s potential.  
>>>> The faster we deliver vector search, the more valuable it will be for this 
>>>> expanding user base.
>>>> Requirements
>>>> 
>>>>  1. Perform vector search as outlined in the Pinecone example above
>>>>    1. Support Float32 embeddings in the form of a new DENSE FLOAT32 cql 
>>>> type
>>>>      1. This is also useful for “classic” ML applications that derive and 
>>>> serve their own feature vectors
>>>>    2. Add ANN (approximate nearest neighbor) search.
>>>>  2. Work with normal Cassandra data flow
>>>>    1. Inserting one row at a time is fine; cannot require batch ingest
>>>>    2. Updating, deleting rows is also fine
>>>>    3. Must compose with other SAI predicates as well as partition keys
>>>> Not requirements
>>>> 
>>>>  1. Other datatypes besides Float32
>>>>    1. *Pinecone supports only Float32 and it’s hugely ahead in mindshare 
>>>> so let’s make things easy on ourselves and follow their precedent.*
>>>>  2. I don’t want to scope creep beyond ANN. In particular, I don’t want to 
>>>> wait for ORDER BY to get exact search in as well.
>>>> How we can do this
>>>> 
>>>> There is exactly one production quality, actively maintained, 
>>>> state-of-the-art ANN library for Java: Lucene’s HNSW.  Elastic has been GA 
>>>> with ANN search backed by this library for just over a year; Solr added it 
>>>> not long after.  As usual with Lucene code, it is not super concerned with 
>>>> making it usable by third parties like Cassandra, but HNSW offers public 
>>>> APIs that are low level enough that we don’t have to bring in Lucene 
>>>> baggage that we don’t want.
>>>> 
>>>> The SAI framework is flexible enough to allow plugging in an HNSW vector 
>>>> index while enjoying the baseline benefits of SAI and playing nice with 
>>>> the other SAI index types.
>>>> What I have today
>>>> 
>>>> I have a pre-alpha branch that demonstrates
>>>>  1. A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type (internally).
>>>>  2. A new SAI index, VectorMemtableIndex
>>>>  3. A new CQL operator, ANN
>>>> 
>>>> So this works:
>>>> cqlsh> create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 
>>>> 'replication_factor': 1};
>>>> cqlsh> use test;
>>>> 
>>>> cqlsh:test> create table foo(i int primary key, j dense float32);
>>>> cqlsh:test> create custom index ann_index on foo(j) using 
>>>> 'StorageAttachedIndex';
>>>> 
>>>> cqlsh:test> insert into foo (i, j) values (1, [8, 2.3, 58]);
>>>> cqlsh:test> insert into foo (i, j) values (2, [1.2, 3.4, 5.6]);
>>>> cqlsh:test> insert into foo (i, j) values (5, [23, 18, 3.9]);
>>>> cqlsh:test> select * from foo where j ann [3.4, 7.8, 9.1] limit 1;
>>>> 
>>>>  i | j
>>>> ---+---------------------------------------------------------
>>>>  5 | b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a'
>>>> 
>>>> Anything else you can imagine asking "does it do X" the answer is no, it 
>>>> doesn't.  It doesn't save to disk, it doesn't compose with other 
>>>> predicates, it doesn't have driver support. That last is why the answer 
>>>> comes back as raw hex bytes.  The CQL parser handling the inserts is 
>>>> server-side and doesn't need driver support, is why we can do inserts and 
>>>> queries at all.
>>>> 
>>>> (If you’re tempted to try this out, remember that “it doesn’t save to 
>>>> disk” means that if you do inserts first and then create the index, 
>>>> instead of the other way around, it will break, because we flush existing 
>>>> data as part of adding a new index to a table.  So it’s quite fragile at 
>>>> the moment!)
>>>> 
>>>> The branch is public at 
>>>> _https://github.com/datastax/cassandra/tree/vsearch_ 
>>>> <https://github.com/jbellis/cassandra/tree/vsearch>
>>>> 
>>>> I based it on the DataStax converged cassandra repo because that’s the 
>>>> only place to get fully working SAI for now.  We plan to upstream vector 
>>>> search with the rest of SAI as soon as it’s ready.
>>>> What’s next
>>>> 
>>>> I think we can get to alpha pretty quickly:
>>>>  1. Add support for flushing, and serving queries from disk without 
>>>> reading the whole graph into memory (Lucene does this already, we will 
>>>> probably have to rewrite a bunch of the code for use in Cassandra, but the 
>>>> design is vetted and works).
>>>>  1. Make sure that scatter/gather works across nodes, this should Just 
>>>> Work but I’ve already needed to tweak a couple things that I thought would 
>>>> just work, so I’m calling this out.
>>>> And then to be feature complete we would want to add:
>>>>  1. CQL support for specifying the dimensionality of a column up front, 
>>>> e.g. DENSE FLOAT32(1500).
>>>>    1. *Pinecone accepts the first vector as the Correct dimensions, and 
>>>> throws an error if you give it one that doesn’t match, but this doesn’t 
>>>> work in a distributed system where the second vector might go to a 
>>>> different node than the first.  Elastic requires specifying it up front 
>>>> like I propose here.*
>>>>  2. Driver support
>>>>  3. Support for updating the vector in a row that hasn’t been flushed yet, 
>>>> either by updating HNSW to support deletes, or by adding some kind of 
>>>> invalidation marker to the overwritten vector.
>>>>  4. More performant inserts.  Currently I have a big synchronized lock 
>>>> around the HNSW graph.  So we either need to shard it, like we do for the 
>>>> other SAI indexes, or add fine-grained locking like ConcurrentSkipListMap 
>>>> to make HnswGraphBuilder concurrent-ish.  I prefer the second option, 
>>>> since it allows us to avoid building the graph once in memory and then a 
>>>> second time on flush, but it might be more work than it appears.
>>>>  5. Add index configurability options: similarity function and HNSW 
>>>> parameters M and ef.
>>>>  6. Expose the similarity functions to CQL so you can SELECT x, 
>>>> cosine_similarity(x, query_vector) FROM …
>>>> Special thanks to
>>>> 
>>>> Mike Adamson and Zhao Yang made substantial contributions to the branch 
>>>> you see here and provided indispensable help understanding SAI.
>>>> 
>>>> <Vector search for Cassandra.pdf>
>>> 
>> 
>> 
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
> 
> 
> --
> 
> 
> 
> 
> *Henrik Ingo*
> 
> *c*. +358 40 569 7354 
> 
> *w*. _www.datastax.com_
> 
> __ <https://www.facebook.com/datastax>  __ <https://twitter.com/datastax>  __ 
> <https://www.linkedin.com/company/datastax/>  __ 
> <https://github.com/datastax/>
> 

Reply via email to