Very exciting. Love it. "Retrieval augmented" or "Data augmented" LLMs are the easiest way to "fine-tune" the output of LLM without actually fine-tuning. Currently Pinecone/Weaviate/Milvus are eating up the scene, with new players like Chroma coming out soon. We've been working with Langchain / LLAMA Index for data ingestion to power this method , primarily with Pinecone, while using Cassandra as the intermediary store .. especially when trying out different chunking lengths for the text retrieval. Being first to have a solid distributed option will be useful not only for people who are currently using vector databases, but also help others start leveraging LLM's with existing infrastructure on Cassandra/Astra etc.
This message was not written by ChatGPT Rahul Singh Chief Executive Officer | Business Platform Architect m: 202.905.2818 e: rahul.si...@anant.us li: http://linkedin.com/in/xingh ca: http://calendly.com/xingh *We create, support, and manage real-time global data & analytics platforms for the modern enterprise.* *Anant | https://anant.us <https://anant.us/>* 3 Washington Circle, Suite 301 Washington, D.C. 20037 *http://Cassandra.Link <http://cassandra.link/>* : The best resources for Apache Cassandra On Mon, Apr 24, 2023 at 5:17 PM David Capwell <dcapw...@apple.com> wrote: > This work sounds interesting, I would recommend decoupling the types from > the ANN support as the types require client changes and can go in now > (would give a lot of breathing room to get this ready for 5.0), where as > ANN depends on SAI which is still being worked on. > > On Apr 22, 2023, at 1:02 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > > If you're interested in playing with HNSW outside of a super alpha > Cassandra branch, I put up a repo with some sample code here: > https://github.com/jbellis/hnswdemo > > On Fri, Apr 21, 2023 at 4:19 PM Jonathan Ellis <jbel...@gmail.com> wrote: > >> >> >> *Happy Friday, everyone!Rich text formatting ahead, I've attached a PDF >> for those who prefer that.* >> >> *I propose adding approximate nearest neighbor (ANN) vector search >> capability to Apache Cassandra via storage-attached indexes (SAI). This is >> a medium-sized effort that will significantly enhance Cassandra’s >> functionality, particularly for AI use cases. This addition will not only >> provide a new and important feature for existing Cassandra users, but also >> attract new users to the community from the AI space, further expanding >> Cassandra’s reach and relevance.* >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *IntroductionVector search is a powerful document search technique that >> enables developers to quickly find relevant content within an extensive >> collection of documents, which is useful as a standalone technique, but it >> is particularly hot now because it significantly enhances the performance >> of LLMs.Vector search uses ML models to match the semantics of a question >> rather than just the words it contains, avoiding the classic false >> positives and false negatives associated with term-based search. >> Alessandro Benedetti gives some good examples in his excellent talk >> <https://www.youtube.com/watch?v=z-i8mOHAhlU>:<image.png><image.png>You can >> search across any set of vectors, which are just ordered sets of numbers. >> In the context of natural language queries and document search, we are >> specifically concerned with a type of vector called an embedding. An >> embedding is a high-dimensional vector that captures the underlying >> semantic relationships and contextual information of words or phrases. >> Embeddings are generated by ML models trained for this purpose; OpenAI >> provides an API to do this, but open-source and self-hostable models like >> BERT are also popular. Creating more accurate and smaller embeddings are >> active research areas in ML.Large language models (LLMs) can be described >> as a mile wide and an inch deep. They are not experts on any narrow domain >> (although they will hallucinate that they are, sometimes convincingly). >> You can remedy this by giving the LLM additional context for your query, >> but the context window is small (4k tokens for GPT-3.5, up to 32k for >> GPT-4), so you want to be very selective about giving the LLM the most >> relevant possible information.Vector search is red-hot now because it >> allows us to easily answer the question “what are the most relevant >> documents to provide as context” by performing a similarity search between >> the embeddings vector of the query, and those of your document universe. >> Doing exact search is prohibitively expensive, since you necessarily have >> to compare with each and every document; this is intractable when you have >> billions or trillions of docs. However, there are well-understood >> algorithms for turning this into a logarithmic problem if you are willing >> to accept approximately the most similar documents. This is the >> “approximate nearest neighbor” problem. (You will see these referred to as >> kNN – k nearest neighbors – or ANN.)Pinecone DB has a good example of what >> this looks like in Python code >> <https://docs.pinecone.io/docs/gen-qa-openai>.Vector search is the >> foundation underlying effectively all of the AI applications that are >> launching now. This is particularly relevant to Apache Cassandra users, >> who tend to manage the types of large datasets that benefit the most from >> fast similarity search. Adding vector search to Cassandra’s unique >> strengths of scale, reliability, and low latency, will further enhance its >> appeal and effectiveness for these users while also making it more >> attractive to newcomers looking to harness AI’s potential. The faster we >> deliver vector search, the more valuable it will be for this expanding user >> base.Requirements 1. Perform vector search as outlined in the Pinecone >> example above1. Support Float32 embeddings in the form of a new DENSE >> FLOAT32 cql type1. This is also useful for “classic” ML applications that >> derive and serve their own feature vectors2. Add ANN (approximate nearest >> neighbor) search.2. Work with normal Cassandra data flow1. Inserting one >> row at a time is fine; cannot require batch ingest2. Updating, deleting >> rows is also fine3. Must compose with other SAI predicates as well as >> partition keysNot requirements 1. Other datatypes besides Float321. >> Pinecone supports only Float32 and it’s hugely ahead in mindshare so let’s >> make things easy on ourselves and follow their precedent.2. I don’t want to >> scope creep beyond ANN. In particular, I don’t want to wait for ORDER BY to >> get exact search in as well.How we can do thisThere is exactly one >> production quality, actively maintained, state-of-the-art ANN library for >> Java: Lucene’s HNSW. Elastic has been GA with ANN search backed by this >> library for just over a year; Solr added it not long after. As usual with >> Lucene code, it is not super concerned with making it usable by third >> parties like Cassandra, but HNSW offers public APIs that are low level >> enough that we don’t have to bring in Lucene baggage that we don’t want.The >> SAI framework is flexible enough to allow plugging in an HNSW vector index >> while enjoying the baseline benefits of SAI and playing nice with the other >> SAI index types.What I have todayI have a pre-alpha branch that >> demonstrates 1. A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type >> (internally).2. A new SAI index, VectorMemtableIndex3. A new CQL operator, >> ANNSo this works:cqlsh> create KEYSPACE test WITH replication = {'class': >> 'SimpleStrategy', 'replication_factor': 1};cqlsh> use test;cqlsh:test> >> create table foo(i int primary key, j dense float32);cqlsh:test> create >> custom index ann_index on foo(j) using 'StorageAttachedIndex';cqlsh:test> >> insert into foo (i, j) values (1, [8, 2.3, 58]);cqlsh:test> insert into foo >> (i, j) values (2, [1.2, 3.4, 5.6]);cqlsh:test> insert into foo (i, j) >> values (5, [23, 18, 3.9]);cqlsh:test> select * from foo where j ann [3.4, >> 7.8, 9.1] limit 1; i | >> j---+--------------------------------------------------------- 5 | >> b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a'Anything else you >> can imagine asking "does it do X" the answer is no, it doesn't. It doesn't >> save to disk, it doesn't compose with other predicates, it doesn't have >> driver support. That last is why the answer comes back as raw hex bytes. >> The CQL parser handling the inserts is server-side and doesn't need driver >> support, is why we can do inserts and queries at all.(If you’re tempted to >> try this out, remember that “it doesn’t save to disk” means that if you do >> inserts first and then create the index, instead of the other way around, >> it will break, because we flush existing data as part of adding a new index >> to a table. So it’s quite fragile at the moment!)The branch is public at >> https://github.com/datastax/cassandra/tree/vsearch >> <https://github.com/jbellis/cassandra/tree/vsearch>I based it on the >> DataStax converged cassandra repo because that’s the only place to get >> fully working SAI for now. We plan to upstream vector search with the rest >> of SAI as soon as it’s ready.What’s nextI think we can get to alpha pretty >> quickly: 1. Add support for flushing, and serving queries from disk without >> reading the whole graph into memory (Lucene does this already, we will >> probably have to rewrite a bunch of the code for use in Cassandra, but the >> design is vetted and works). 1. Make sure that scatter/gather works across >> nodes, this should Just Work but I’ve already needed to tweak a couple >> things that I thought would just work, so I’m calling this out.And then to >> be feature complete we would want to add: 1. CQL support for specifying the >> dimensionality of a column up front, e.g. DENSE FLOAT32(1500).1. Pinecone >> accepts the first vector as the Correct dimensions, and throws an error if >> you give it one that doesn’t match, but this doesn’t work in a distributed >> system where the second vector might go to a different node than the >> first. Elastic requires specifying it up front like I propose here.2. >> Driver support3. Support for updating the vector in a row that hasn’t been >> flushed yet, either by updating HNSW to support deletes, or by adding some >> kind of invalidation marker to the overwritten vector.4. More performant >> inserts. Currently I have a big synchronized lock around the HNSW graph. >> So we either need to shard it, like we do for the other SAI indexes, or add >> fine-grained locking like ConcurrentSkipListMap to make HnswGraphBuilder >> concurrent-ish. I prefer the second option, since it allows us to avoid >> building the graph once in memory and then a second time on flush, but it >> might be more work than it appears.5. Add index configurability options: >> similarity function and HNSW parameters M and ef.6. Expose the similarity >> functions to CQL so you can SELECT x, cosine_similarity(x, query_vector) >> FROM …Special thanks toMike Adamson and Zhao Yang made substantial >> contributions to the branch you see here and provided indispensable help >> understanding SAI.* >> > > > -- > Jonathan Ellis > co-founder, http://www.datastax.com > @spyced > > >