Jonathan what a great proposal/code. An enjoyable read. And at least for me
educational! (Which is notable, as you're on my turf, I'm a Data Science
major.)

Sorry for splitting hairs but CEP-7 (as a spec, and wiki page) is approved
and voted on and I assume there's no proposal to change that. That said,
work of course continues beyond CEP-7 and this is not the only SAI feature
that adds on top of the CEP-7 foundation.

I just wanted to clarify so there's no confusion later.

henrik

On Sat, Apr 22, 2023 at 10:41 PM Jonathan Ellis <jbel...@gmail.com> wrote:

> My guess is that I will be able to get this ready to upstream before the
> rest of CEP-7 goes in, so it would make sense to me to roll it into that.
>
> On Fri, Apr 21, 2023 at 5:34 PM Dinesh Joshi <djo...@apache.org> wrote:
>
>> Interesting proposal Jonathan. Will grok it over the weekend and play
>> around with the branch.
>>
>> Do you intend to make this part of CEP-7 or as an incremental update to
>> SAI once it is committed?
>>
>> On Apr 21, 2023, at 2:19 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>
>>
>>
>> *Happy Friday, everyone!Rich text formatting ahead, I've attached a PDF
>> for those who prefer that.*
>>
>> *I propose adding approximate nearest neighbor (ANN) vector search
>> capability to Apache Cassandra via storage-attached indexes (SAI). This is
>> a medium-sized effort that will significantly enhance Cassandra’s
>> functionality, particularly for AI use cases. This addition will not only
>> provide a new and important feature for existing Cassandra users, but also
>> attract new users to the community from the AI space, further expanding
>> Cassandra’s reach and relevance.*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *IntroductionVector search is a powerful document search technique that
>> enables developers to quickly find relevant content within an extensive
>> collection of documents, which is useful as a standalone technique, but it
>> is particularly hot now because it significantly enhances the performance
>> of LLMs.Vector search uses ML models to match the semantics of a question
>> rather than just the words it contains, avoiding the classic false
>> positives and false negatives associated with term-based search.
>> Alessandro Benedetti gives some good examples in his excellent talk
>> <https://urldefense.com/v3/__https://www.youtube.com/watch?v=z-i8mOHAhlU__;!!PbtH5S7Ebw!fGOkCfeX4xWlavpoBz8Q24N4rJ1QKFlHAqMhO6h89IfCtjhQJP2Es0k7PAEGKIfU49s0O7KE0tA5V5bm-g$>:<image.png><image.png>You
>> can search across any set of vectors, which are just ordered sets of
>> numbers.  In the context of natural language queries and document search,
>> we are specifically concerned with a type of vector called an
>> embedding.  An embedding is a high-dimensional vector that captures the
>> underlying semantic relationships and contextual information of words or
>> phrases. Embeddings are generated by ML models trained for this purpose;
>> OpenAI provides an API to do this, but open-source and self-hostable models
>> like BERT are also popular. Creating more accurate and smaller embeddings
>> are active research areas in ML.Large language models (LLMs) can be
>> described as a mile wide and an inch deep. They are not experts on any
>> narrow domain (although they will hallucinate that they are, sometimes
>> convincingly).  You can remedy this by giving the LLM additional context
>> for your query, but the context window is small (4k tokens for GPT-3.5, up
>> to 32k for GPT-4), so you want to be very selective about giving the LLM
>> the most relevant possible information.Vector search is red-hot now because
>> it allows us to easily answer the question “what are the most relevant
>> documents to provide as context” by performing a similarity search between
>> the embeddings vector of the query, and those of your document universe.
>> Doing exact search is prohibitively expensive, since you necessarily have
>> to compare with each and every document; this is intractable when you have
>> billions or trillions of docs.  However, there are well-understood
>> algorithms for turning this into a logarithmic problem if you are willing
>> to accept approximately the most similar documents.  This is the
>> “approximate nearest neighbor” problem.  (You will see these referred to as
>> kNN – k nearest neighbors – or ANN.)Pinecone DB has a good example of what
>> this looks like in Python code
>> <https://urldefense.com/v3/__https://docs.pinecone.io/docs/gen-qa-openai__;!!PbtH5S7Ebw!fGOkCfeX4xWlavpoBz8Q24N4rJ1QKFlHAqMhO6h89IfCtjhQJP2Es0k7PAEGKIfU49s0O7KE0tD9egmitQ$>.Vector
>> search is the foundation underlying effectively all of the AI applications
>> that are launching now.  This is particularly relevant to Apache Cassandra
>> users, who tend to manage the types of large datasets that benefit the most
>> from fast similarity search. Adding vector search to Cassandra’s unique
>> strengths of scale, reliability, and low latency, will further enhance its
>> appeal and effectiveness for these users while also making it more
>> attractive to newcomers looking to harness AI’s potential.  The faster we
>> deliver vector search, the more valuable it will be for this expanding user
>> base.Requirements 1. Perform vector search as outlined in the Pinecone
>> example above1. Support Float32 embeddings in the form of a new DENSE
>> FLOAT32 cql type1. This is also useful for “classic” ML applications that
>> derive and serve their own feature vectors2. Add ANN (approximate nearest
>> neighbor) search.2. Work with normal Cassandra data flow1. Inserting one
>> row at a time is fine; cannot require batch ingest2. Updating, deleting
>> rows is also fine3. Must compose with other SAI predicates as well as
>> partition keysNot requirements 1. Other datatypes besides Float321.
>> Pinecone supports only Float32 and it’s hugely ahead in mindshare so let’s
>> make things easy on ourselves and follow their precedent.2. I don’t want to
>> scope creep beyond ANN. In particular, I don’t want to wait for ORDER BY to
>> get exact search in as well.How we can do thisThere is exactly one
>> production quality, actively maintained, state-of-the-art ANN library for
>> Java: Lucene’s HNSW.  Elastic has been GA with ANN search backed by this
>> library for just over a year; Solr added it not long after.  As usual with
>> Lucene code, it is not super concerned with making it usable by third
>> parties like Cassandra, but HNSW offers public APIs that are low level
>> enough that we don’t have to bring in Lucene baggage that we don’t want.The
>> SAI framework is flexible enough to allow plugging in an HNSW vector index
>> while enjoying the baseline benefits of SAI and playing nice with the other
>> SAI index types.What I have todayI have a pre-alpha branch that
>> demonstrates 1. A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type
>> (internally).2. A new SAI index, VectorMemtableIndex3. A new CQL operator,
>> ANNSo this works:cqlsh> create KEYSPACE test WITH replication = {'class':
>> 'SimpleStrategy', 'replication_factor': 1};cqlsh> use test;cqlsh:test>
>> create table foo(i int primary key, j dense float32);cqlsh:test> create
>> custom index ann_index on foo(j) using 'StorageAttachedIndex';cqlsh:test>
>> insert into foo (i, j) values (1, [8, 2.3, 58]);cqlsh:test> insert into foo
>> (i, j) values (2, [1.2, 3.4, 5.6]);cqlsh:test> insert into foo (i, j)
>> values (5, [23, 18, 3.9]);cqlsh:test> select * from foo where j ann [3.4,
>> 7.8, 9.1] limit 1; i |
>> j---+--------------------------------------------------------- 5 |
>> b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a'Anything else you
>> can imagine asking "does it do X" the answer is no, it doesn't.  It doesn't
>> save to disk, it doesn't compose with other predicates, it doesn't have
>> driver support. That last is why the answer comes back as raw hex bytes.
>> The CQL parser handling the inserts is server-side and doesn't need driver
>> support, is why we can do inserts and queries at all.(If you’re tempted to
>> try this out, remember that “it doesn’t save to disk” means that if you do
>> inserts first and then create the index, instead of the other way around,
>> it will break, because we flush existing data as part of adding a new index
>> to a table.  So it’s quite fragile at the moment!)The branch is public at
>> https://github.com/datastax/cassandra/tree/vsearch
>> <https://github.com/jbellis/cassandra/tree/vsearch>I based it on the
>> DataStax converged cassandra repo because that’s the only place to get
>> fully working SAI for now.  We plan to upstream vector search with the rest
>> of SAI as soon as it’s ready.What’s nextI think we can get to alpha pretty
>> quickly: 1. Add support for flushing, and serving queries from disk without
>> reading the whole graph into memory (Lucene does this already, we will
>> probably have to rewrite a bunch of the code for use in Cassandra, but the
>> design is vetted and works). 1. Make sure that scatter/gather works across
>> nodes, this should Just Work but I’ve already needed to tweak a couple
>> things that I thought would just work, so I’m calling this out.And then to
>> be feature complete we would want to add: 1. CQL support for specifying the
>> dimensionality of a column up front, e.g. DENSE FLOAT32(1500).1. Pinecone
>> accepts the first vector as the Correct dimensions, and throws an error if
>> you give it one that doesn’t match, but this doesn’t work in a distributed
>> system where the second vector might go to a different node than the
>> first.  Elastic requires specifying it up front like I propose here.2.
>> Driver support3. Support for updating the vector in a row that hasn’t been
>> flushed yet, either by updating HNSW to support deletes, or by adding some
>> kind of invalidation marker to the overwritten vector.4. More performant
>> inserts.  Currently I have a big synchronized lock around the HNSW graph.
>> So we either need to shard it, like we do for the other SAI indexes, or add
>> fine-grained locking like ConcurrentSkipListMap to make HnswGraphBuilder
>> concurrent-ish.  I prefer the second option, since it allows us to avoid
>> building the graph once in memory and then a second time on flush, but it
>> might be more work than it appears.5. Add index configurability options:
>> similarity function and HNSW parameters M and ef.6. Expose the similarity
>> functions to CQL so you can SELECT x, cosine_similarity(x, query_vector)
>> FROM …Special thanks toMike Adamson and Zhao Yang made substantial
>> contributions to the branch you see here and provided indispensable help
>> understanding SAI.*
>> <Vector search for Cassandra.pdf>
>>
>>
>>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 

Henrik Ingo

c. +358 40 569 7354

w. www.datastax.com

<https://www.facebook.com/datastax>  <https://twitter.com/datastax>
<https://www.linkedin.com/company/datastax/>  <https://github.com/datastax/>

Reply via email to