This work sounds interesting, I would recommend decoupling the types from the 
ANN support as the types require client changes and can go in now (would give a 
lot of breathing room to get this ready for 5.0), where as ANN depends on SAI 
which is still being worked on.

> On Apr 22, 2023, at 1:02 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
> 
> If you're interested in playing with HNSW outside of a super alpha Cassandra 
> branch, I put up a repo with some sample code here: 
> https://github.com/jbellis/hnswdemo
> 
> On Fri, Apr 21, 2023 at 4:19 PM Jonathan Ellis <jbel...@gmail.com 
> <mailto:jbel...@gmail.com>> wrote:
>> Happy Friday, everyone!
>> 
>> Rich text formatting ahead, I've attached a PDF for those who prefer that.
>> 
>> I propose adding approximate nearest neighbor (ANN) vector search capability 
>> to Apache Cassandra via storage-attached indexes (SAI). This is a 
>> medium-sized effort that will significantly enhance Cassandra’s 
>> functionality, particularly for AI use cases. This addition will not only 
>> provide a new and important feature for existing Cassandra users, but also 
>> attract new users to the community from the AI space, further expanding 
>> Cassandra’s reach and relevance.
>> Introduction
>> Vector search is a powerful document search technique that enables 
>> developers to quickly find relevant content within an extensive collection 
>> of documents, which is useful as a standalone technique, but it is 
>> particularly hot now because it significantly enhances the performance of 
>> LLMs.
>> 
>> Vector search uses ML models to match the semantics of a question rather 
>> than just the words it contains, avoiding the classic false positives and 
>> false negatives associated with term-based search.  Alessandro Benedetti 
>> gives some good examples in his excellent talk 
>> <https://www.youtube.com/watch?v=z-i8mOHAhlU>:
>> 
>> <image.png>
>> 
>> <image.png>
>> 
>> You can search across any set of vectors, which are just ordered sets of 
>> numbers.  In the context of natural language queries and document search, we 
>> are specifically concerned with a type of vector called an embedding.  
>> 
>> An embedding is a high-dimensional vector that captures the underlying 
>> semantic relationships and contextual information of words or phrases. 
>> Embeddings are generated by ML models trained for this purpose; OpenAI 
>> provides an API to do this, but open-source and self-hostable models like 
>> BERT are also popular. Creating more accurate and smaller embeddings are 
>> active research areas in ML.
>> 
>> Large language models (LLMs) can be described as a mile wide and an inch 
>> deep. They are not experts on any narrow domain (although they will 
>> hallucinate that they are, sometimes convincingly).  You can remedy this by 
>> giving the LLM additional context for your query, but the context window is 
>> small (4k tokens for GPT-3.5, up to 32k for GPT-4), so you want to be very 
>> selective about giving the LLM the most relevant possible information.
>> 
>> Vector search is red-hot now because it allows us to easily answer the 
>> question “what are the most relevant documents to provide as context” by 
>> performing a similarity search between the embeddings vector of the query, 
>> and those of your document universe.  Doing exact search is prohibitively 
>> expensive, since you necessarily have to compare with each and every 
>> document; this is intractable when you have billions or trillions of docs.  
>> However, there are well-understood algorithms for turning this into a 
>> logarithmic problem if you are willing to accept approximately the most 
>> similar documents.  This is the “approximate nearest neighbor” problem.  
>> (You will see these referred to as kNN – k nearest neighbors – or ANN.)
>> 
>> Pinecone DB has a good example of what this looks like in Python code 
>> <https://docs.pinecone.io/docs/gen-qa-openai>.
>> 
>> Vector search is the foundation underlying effectively all of the AI 
>> applications that are launching now.  This is particularly relevant to 
>> Apache Cassandra users, who tend to manage the types of large datasets that 
>> benefit the most from fast similarity search. Adding vector search to 
>> Cassandra’s unique strengths of scale, reliability, and low latency, will 
>> further enhance its appeal and effectiveness for these users while also 
>> making it more attractive to newcomers looking to harness AI’s potential.  
>> The faster we deliver vector search, the more valuable it will be for this 
>> expanding user base.
>> Requirements
>> Perform vector search as outlined in the Pinecone example above
>> Support Float32 embeddings in the form of a new DENSE FLOAT32 cql type
>> This is also useful for “classic” ML applications that derive and serve 
>> their own feature vectors
>> Add ANN (approximate nearest neighbor) search.
>> Work with normal Cassandra data flow
>> Inserting one row at a time is fine; cannot require batch ingest
>> Updating, deleting rows is also fine
>> Must compose with other SAI predicates as well as partition keys
>> Not requirements
>> Other datatypes besides Float32
>> Pinecone supports only Float32 and it’s hugely ahead in mindshare so let’s 
>> make things easy on ourselves and follow their precedent.
>> I don’t want to scope creep beyond ANN. In particular, I don’t want to wait 
>> for ORDER BY to get exact search in as well.
>> How we can do this
>> There is exactly one production quality, actively maintained, 
>> state-of-the-art ANN library for Java: Lucene’s HNSW.  Elastic has been GA 
>> with ANN search backed by this library for just over a year; Solr added it 
>> not long after.  As usual with Lucene code, it is not super concerned with 
>> making it usable by third parties like Cassandra, but HNSW offers public 
>> APIs that are low level enough that we don’t have to bring in Lucene baggage 
>> that we don’t want.
>> 
>> The SAI framework is flexible enough to allow plugging in an HNSW vector 
>> index while enjoying the baseline benefits of SAI and playing nice with the 
>> other SAI index types.
>> What I have today
>> I have a pre-alpha branch that demonstrates
>> A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type (internally).
>> A new SAI index, VectorMemtableIndex
>> A new CQL operator, ANN
>> 
>> So this works:
>> cqlsh> create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 
>> 'replication_factor': 1};
>> cqlsh> use test;
>> 
>> cqlsh:test> create table foo(i int primary key, j dense float32);
>> cqlsh:test> create custom index ann_index on foo(j) using 
>> 'StorageAttachedIndex';
>> 
>> cqlsh:test> insert into foo (i, j) values (1, [8, 2.3, 58]);
>> cqlsh:test> insert into foo (i, j) values (2, [1.2, 3.4, 5.6]);
>> cqlsh:test> insert into foo (i, j) values (5, [23, 18, 3.9]);
>> cqlsh:test> select * from foo where j ann [3.4, 7.8, 9.1] limit 1;
>> 
>>  i | j
>> ---+---------------------------------------------------------
>>  5 | b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a'
>> 
>> Anything else you can imagine asking "does it do X" the answer is no, it 
>> doesn't.  It doesn't save to disk, it doesn't compose with other predicates, 
>> it doesn't have driver support. That last is why the answer comes back as 
>> raw hex bytes.  The CQL parser handling the inserts is server-side and 
>> doesn't need driver support, is why we can do inserts and queries at all.
>> 
>> (If you’re tempted to try this out, remember that “it doesn’t save to disk” 
>> means that if you do inserts first and then create the index, instead of the 
>> other way around, it will break, because we flush existing data as part of 
>> adding a new index to a table.  So it’s quite fragile at the moment!)
>> 
>> The branch is public at https://github.com/datastax/cassandra/tree/vsearch 
>> <https://github.com/jbellis/cassandra/tree/vsearch>
>> I based it on the DataStax converged cassandra repo because that’s the only 
>> place to get fully working SAI for now.  We plan to upstream vector search 
>> with the rest of SAI as soon as it’s ready.
>> What’s next
>> I think we can get to alpha pretty quickly:
>> Add support for flushing, and serving queries from disk without reading the 
>> whole graph into memory (Lucene does this already, we will probably have to 
>> rewrite a bunch of the code for use in Cassandra, but the design is vetted 
>> and works).
>> Make sure that scatter/gather works across nodes, this should Just Work but 
>> I’ve already needed to tweak a couple things that I thought would just work, 
>> so I’m calling this out.
>> And then to be feature complete we would want to add:
>> CQL support for specifying the dimensionality of a column up front, e.g. 
>> DENSE FLOAT32(1500).
>> Pinecone accepts the first vector as the Correct dimensions, and throws an 
>> error if you give it one that doesn’t match, but this doesn’t work in a 
>> distributed system where the second vector might go to a different node than 
>> the first.  Elastic requires specifying it up front like I propose here.
>> Driver support
>> Support for updating the vector in a row that hasn’t been flushed yet, 
>> either by updating HNSW to support deletes, or by adding some kind of 
>> invalidation marker to the overwritten vector.
>> More performant inserts.  Currently I have a big synchronized lock around 
>> the HNSW graph.  So we either need to shard it, like we do for the other SAI 
>> indexes, or add fine-grained locking like ConcurrentSkipListMap to make 
>> HnswGraphBuilder concurrent-ish.  I prefer the second option, since it 
>> allows us to avoid building the graph once in memory and then a second time 
>> on flush, but it might be more work than it appears.
>> Add index configurability options: similarity function and HNSW parameters M 
>> and ef.
>> Expose the similarity functions to CQL so you can SELECT x, 
>> cosine_similarity(x, query_vector) FROM …
>> Special thanks to
>> Mike Adamson and Zhao Yang made substantial contributions to the branch you 
>> see here and provided indispensable help understanding SAI.
>> 
> 
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com <http://www.datastax.com/>
> @spyced

Reply via email to