My guess is that I will be able to get this ready to upstream before the
rest of CEP-7 goes in, so it would make sense to me to roll it into that.

On Fri, Apr 21, 2023 at 5:34 PM Dinesh Joshi <djo...@apache.org> wrote:

> Interesting proposal Jonathan. Will grok it over the weekend and play
> around with the branch.
>
> Do you intend to make this part of CEP-7 or as an incremental update to
> SAI once it is committed?
>
> On Apr 21, 2023, at 2:19 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>
>
>
> *Happy Friday, everyone!Rich text formatting ahead, I've attached a PDF
> for those who prefer that.*
>
> *I propose adding approximate nearest neighbor (ANN) vector search
> capability to Apache Cassandra via storage-attached indexes (SAI). This is
> a medium-sized effort that will significantly enhance Cassandra’s
> functionality, particularly for AI use cases. This addition will not only
> provide a new and important feature for existing Cassandra users, but also
> attract new users to the community from the AI space, further expanding
> Cassandra’s reach and relevance.*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *IntroductionVector search is a powerful document search technique that
> enables developers to quickly find relevant content within an extensive
> collection of documents, which is useful as a standalone technique, but it
> is particularly hot now because it significantly enhances the performance
> of LLMs.Vector search uses ML models to match the semantics of a question
> rather than just the words it contains, avoiding the classic false
> positives and false negatives associated with term-based search.
> Alessandro Benedetti gives some good examples in his excellent talk
> <https://www.youtube.com/watch?v=z-i8mOHAhlU>:<image.png><image.png>You can
> search across any set of vectors, which are just ordered sets of numbers.
> In the context of natural language queries and document search, we are
> specifically concerned with a type of vector called an embedding.  An
> embedding is a high-dimensional vector that captures the underlying
> semantic relationships and contextual information of words or phrases.
> Embeddings are generated by ML models trained for this purpose; OpenAI
> provides an API to do this, but open-source and self-hostable models like
> BERT are also popular. Creating more accurate and smaller embeddings are
> active research areas in ML.Large language models (LLMs) can be described
> as a mile wide and an inch deep. They are not experts on any narrow domain
> (although they will hallucinate that they are, sometimes convincingly).
> You can remedy this by giving the LLM additional context for your query,
> but the context window is small (4k tokens for GPT-3.5, up to 32k for
> GPT-4), so you want to be very selective about giving the LLM the most
> relevant possible information.Vector search is red-hot now because it
> allows us to easily answer the question “what are the most relevant
> documents to provide as context” by performing a similarity search between
> the embeddings vector of the query, and those of your document universe.
> Doing exact search is prohibitively expensive, since you necessarily have
> to compare with each and every document; this is intractable when you have
> billions or trillions of docs.  However, there are well-understood
> algorithms for turning this into a logarithmic problem if you are willing
> to accept approximately the most similar documents.  This is the
> “approximate nearest neighbor” problem.  (You will see these referred to as
> kNN – k nearest neighbors – or ANN.)Pinecone DB has a good example of what
> this looks like in Python code
> <https://docs.pinecone.io/docs/gen-qa-openai>.Vector search is the
> foundation underlying effectively all of the AI applications that are
> launching now.  This is particularly relevant to Apache Cassandra users,
> who tend to manage the types of large datasets that benefit the most from
> fast similarity search. Adding vector search to Cassandra’s unique
> strengths of scale, reliability, and low latency, will further enhance its
> appeal and effectiveness for these users while also making it more
> attractive to newcomers looking to harness AI’s potential.  The faster we
> deliver vector search, the more valuable it will be for this expanding user
> base.Requirements 1. Perform vector search as outlined in the Pinecone
> example above1. Support Float32 embeddings in the form of a new DENSE
> FLOAT32 cql type1. This is also useful for “classic” ML applications that
> derive and serve their own feature vectors2. Add ANN (approximate nearest
> neighbor) search.2. Work with normal Cassandra data flow1. Inserting one
> row at a time is fine; cannot require batch ingest2. Updating, deleting
> rows is also fine3. Must compose with other SAI predicates as well as
> partition keysNot requirements 1. Other datatypes besides Float321.
> Pinecone supports only Float32 and it’s hugely ahead in mindshare so let’s
> make things easy on ourselves and follow their precedent.2. I don’t want to
> scope creep beyond ANN. In particular, I don’t want to wait for ORDER BY to
> get exact search in as well.How we can do thisThere is exactly one
> production quality, actively maintained, state-of-the-art ANN library for
> Java: Lucene’s HNSW.  Elastic has been GA with ANN search backed by this
> library for just over a year; Solr added it not long after.  As usual with
> Lucene code, it is not super concerned with making it usable by third
> parties like Cassandra, but HNSW offers public APIs that are low level
> enough that we don’t have to bring in Lucene baggage that we don’t want.The
> SAI framework is flexible enough to allow plugging in an HNSW vector index
> while enjoying the baseline benefits of SAI and playing nice with the other
> SAI index types.What I have todayI have a pre-alpha branch that
> demonstrates 1. A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type
> (internally).2. A new SAI index, VectorMemtableIndex3. A new CQL operator,
> ANNSo this works:cqlsh> create KEYSPACE test WITH replication = {'class':
> 'SimpleStrategy', 'replication_factor': 1};cqlsh> use test;cqlsh:test>
> create table foo(i int primary key, j dense float32);cqlsh:test> create
> custom index ann_index on foo(j) using 'StorageAttachedIndex';cqlsh:test>
> insert into foo (i, j) values (1, [8, 2.3, 58]);cqlsh:test> insert into foo
> (i, j) values (2, [1.2, 3.4, 5.6]);cqlsh:test> insert into foo (i, j)
> values (5, [23, 18, 3.9]);cqlsh:test> select * from foo where j ann [3.4,
> 7.8, 9.1] limit 1; i |
> j---+--------------------------------------------------------- 5 |
> b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a'Anything else you
> can imagine asking "does it do X" the answer is no, it doesn't.  It doesn't
> save to disk, it doesn't compose with other predicates, it doesn't have
> driver support. That last is why the answer comes back as raw hex bytes.
> The CQL parser handling the inserts is server-side and doesn't need driver
> support, is why we can do inserts and queries at all.(If you’re tempted to
> try this out, remember that “it doesn’t save to disk” means that if you do
> inserts first and then create the index, instead of the other way around,
> it will break, because we flush existing data as part of adding a new index
> to a table.  So it’s quite fragile at the moment!)The branch is public at
> https://github.com/datastax/cassandra/tree/vsearch
> <https://github.com/jbellis/cassandra/tree/vsearch>I based it on the
> DataStax converged cassandra repo because that’s the only place to get
> fully working SAI for now.  We plan to upstream vector search with the rest
> of SAI as soon as it’s ready.What’s nextI think we can get to alpha pretty
> quickly: 1. Add support for flushing, and serving queries from disk without
> reading the whole graph into memory (Lucene does this already, we will
> probably have to rewrite a bunch of the code for use in Cassandra, but the
> design is vetted and works). 1. Make sure that scatter/gather works across
> nodes, this should Just Work but I’ve already needed to tweak a couple
> things that I thought would just work, so I’m calling this out.And then to
> be feature complete we would want to add: 1. CQL support for specifying the
> dimensionality of a column up front, e.g. DENSE FLOAT32(1500).1. Pinecone
> accepts the first vector as the Correct dimensions, and throws an error if
> you give it one that doesn’t match, but this doesn’t work in a distributed
> system where the second vector might go to a different node than the
> first.  Elastic requires specifying it up front like I propose here.2.
> Driver support3. Support for updating the vector in a row that hasn’t been
> flushed yet, either by updating HNSW to support deletes, or by adding some
> kind of invalidation marker to the overwritten vector.4. More performant
> inserts.  Currently I have a big synchronized lock around the HNSW graph.
> So we either need to shard it, like we do for the other SAI indexes, or add
> fine-grained locking like ConcurrentSkipListMap to make HnswGraphBuilder
> concurrent-ish.  I prefer the second option, since it allows us to avoid
> building the graph once in memory and then a second time on flush, but it
> might be more work than it appears.5. Add index configurability options:
> similarity function and HNSW parameters M and ef.6. Expose the similarity
> functions to CQL so you can SELECT x, cosine_similarity(x, query_vector)
> FROM …Special thanks toMike Adamson and Zhao Yang made substantial
> contributions to the branch you see here and provided indispensable help
> understanding SAI.*
> <Vector search for Cassandra.pdf>
>
>
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Reply via email to