Re: Adding vector search to SAI with heirarchical navigable small world graph index

Rahul Xavier Singh Tue, 25 Apr 2023 14:30:21 -0700

Very exciting. Love it.

"Retrieval augmented" or "Data augmented" LLMs are the easiest way to
"fine-tune" the output of LLM without actually fine-tuning. Currently
Pinecone/Weaviate/Milvus are eating up the scene, with new players like
Chroma coming out soon.
We've been working with Langchain / LLAMA Index for data ingestion to power
this method , primarily with Pinecone, while using Cassandra as the
intermediary store .. especially when trying out different chunking lengths
for the text retrieval.
Being first to have a solid distributed option will be useful not only for
people who are currently using vector databases, but also help others start
leveraging LLM's with existing infrastructure on Cassandra/Astra etc.


This message was not written by ChatGPT
Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us <https://anant.us/>*

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link <http://cassandra.link/>* : The best resources for
Apache Cassandra


On Mon, Apr 24, 2023 at 5:17 PM David Capwell <dcapw...@apple.com> wrote:

> This work sounds interesting, I would recommend decoupling the types from
> the ANN support as the types require client changes and can go in now
> (would give a lot of breathing room to get this ready for 5.0), where as
> ANN depends on SAI which is still being worked on.
>
> On Apr 22, 2023, at 1:02 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>
> If you're interested in playing with HNSW outside of a super alpha
> Cassandra branch, I put up a repo with some sample code here:
> https://github.com/jbellis/hnswdemo
>
> On Fri, Apr 21, 2023 at 4:19 PM Jonathan Ellis <jbel...@gmail.com> wrote:
>
>>
>>
>> *Happy Friday, everyone!Rich text formatting ahead, I've attached a PDF
>> for those who prefer that.*
>>
>> *I propose adding approximate nearest neighbor (ANN) vector search
>> capability to Apache Cassandra via storage-attached indexes (SAI). This is
>> a medium-sized effort that will significantly enhance Cassandra’s
>> functionality, particularly for AI use cases. This addition will not only
>> provide a new and important feature for existing Cassandra users, but also
>> attract new users to the community from the AI space, further expanding
>> Cassandra’s reach and relevance.*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *IntroductionVector search is a powerful document search technique that
>> enables developers to quickly find relevant content within an extensive
>> collection of documents, which is useful as a standalone technique, but it
>> is particularly hot now because it significantly enhances the performance
>> of LLMs.Vector search uses ML models to match the semantics of a question
>> rather than just the words it contains, avoiding the classic false
>> positives and false negatives associated with term-based search.
>> Alessandro Benedetti gives some good examples in his excellent talk
>> <https://www.youtube.com/watch?v=z-i8mOHAhlU>:<image.png><image.png>You can
>> search across any set of vectors, which are just ordered sets of numbers.
>> In the context of natural language queries and document search, we are
>> specifically concerned with a type of vector called an embedding.  An
>> embedding is a high-dimensional vector that captures the underlying
>> semantic relationships and contextual information of words or phrases.
>> Embeddings are generated by ML models trained for this purpose; OpenAI
>> provides an API to do this, but open-source and self-hostable models like
>> BERT are also popular. Creating more accurate and smaller embeddings are
>> active research areas in ML.Large language models (LLMs) can be described
>> as a mile wide and an inch deep. They are not experts on any narrow domain
>> (although they will hallucinate that they are, sometimes convincingly).
>> You can remedy this by giving the LLM additional context for your query,
>> but the context window is small (4k tokens for GPT-3.5, up to 32k for
>> GPT-4), so you want to be very selective about giving the LLM the most
>> relevant possible information.Vector search is red-hot now because it
>> allows us to easily answer the question “what are the most relevant
>> documents to provide as context” by performing a similarity search between
>> the embeddings vector of the query, and those of your document universe.
>> Doing exact search is prohibitively expensive, since you necessarily have
>> to compare with each and every document; this is intractable when you have
>> billions or trillions of docs.  However, there are well-understood
>> algorithms for turning this into a logarithmic problem if you are willing
>> to accept approximately the most similar documents.  This is the
>> “approximate nearest neighbor” problem.  (You will see these referred to as
>> kNN – k nearest neighbors – or ANN.)Pinecone DB has a good example of what
>> this looks like in Python code
>> <https://docs.pinecone.io/docs/gen-qa-openai>.Vector search is the
>> foundation underlying effectively all of the AI applications that are
>> launching now.  This is particularly relevant to Apache Cassandra users,
>> who tend to manage the types of large datasets that benefit the most from
>> fast similarity search. Adding vector search to Cassandra’s unique
>> strengths of scale, reliability, and low latency, will further enhance its
>> appeal and effectiveness for these users while also making it more
>> attractive to newcomers looking to harness AI’s potential.  The faster we
>> deliver vector search, the more valuable it will be for this expanding user
>> base.Requirements 1. Perform vector search as outlined in the Pinecone
>> example above1. Support Float32 embeddings in the form of a new DENSE
>> FLOAT32 cql type1. This is also useful for “classic” ML applications that
>> derive and serve their own feature vectors2. Add ANN (approximate nearest
>> neighbor) search.2. Work with normal Cassandra data flow1. Inserting one
>> row at a time is fine; cannot require batch ingest2. Updating, deleting
>> rows is also fine3. Must compose with other SAI predicates as well as
>> partition keysNot requirements 1. Other datatypes besides Float321.
>> Pinecone supports only Float32 and it’s hugely ahead in mindshare so let’s
>> make things easy on ourselves and follow their precedent.2. I don’t want to
>> scope creep beyond ANN. In particular, I don’t want to wait for ORDER BY to
>> get exact search in as well.How we can do thisThere is exactly one
>> production quality, actively maintained, state-of-the-art ANN library for
>> Java: Lucene’s HNSW.  Elastic has been GA with ANN search backed by this
>> library for just over a year; Solr added it not long after.  As usual with
>> Lucene code, it is not super concerned with making it usable by third
>> parties like Cassandra, but HNSW offers public APIs that are low level
>> enough that we don’t have to bring in Lucene baggage that we don’t want.The
>> SAI framework is flexible enough to allow plugging in an HNSW vector index
>> while enjoying the baseline benefits of SAI and playing nice with the other
>> SAI index types.What I have todayI have a pre-alpha branch that
>> demonstrates 1. A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type
>> (internally).2. A new SAI index, VectorMemtableIndex3. A new CQL operator,
>> ANNSo this works:cqlsh> create KEYSPACE test WITH replication = {'class':
>> 'SimpleStrategy', 'replication_factor': 1};cqlsh> use test;cqlsh:test>
>> create table foo(i int primary key, j dense float32);cqlsh:test> create
>> custom index ann_index on foo(j) using 'StorageAttachedIndex';cqlsh:test>
>> insert into foo (i, j) values (1, [8, 2.3, 58]);cqlsh:test> insert into foo
>> (i, j) values (2, [1.2, 3.4, 5.6]);cqlsh:test> insert into foo (i, j)
>> values (5, [23, 18, 3.9]);cqlsh:test> select * from foo where j ann [3.4,
>> 7.8, 9.1] limit 1; i |
>> j---+--------------------------------------------------------- 5 |
>> b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a'Anything else you
>> can imagine asking "does it do X" the answer is no, it doesn't.  It doesn't
>> save to disk, it doesn't compose with other predicates, it doesn't have
>> driver support. That last is why the answer comes back as raw hex bytes.
>> The CQL parser handling the inserts is server-side and doesn't need driver
>> support, is why we can do inserts and queries at all.(If you’re tempted to
>> try this out, remember that “it doesn’t save to disk” means that if you do
>> inserts first and then create the index, instead of the other way around,
>> it will break, because we flush existing data as part of adding a new index
>> to a table.  So it’s quite fragile at the moment!)The branch is public at
>> https://github.com/datastax/cassandra/tree/vsearch
>> <https://github.com/jbellis/cassandra/tree/vsearch>I based it on the
>> DataStax converged cassandra repo because that’s the only place to get
>> fully working SAI for now.  We plan to upstream vector search with the rest
>> of SAI as soon as it’s ready.What’s nextI think we can get to alpha pretty
>> quickly: 1. Add support for flushing, and serving queries from disk without
>> reading the whole graph into memory (Lucene does this already, we will
>> probably have to rewrite a bunch of the code for use in Cassandra, but the
>> design is vetted and works). 1. Make sure that scatter/gather works across
>> nodes, this should Just Work but I’ve already needed to tweak a couple
>> things that I thought would just work, so I’m calling this out.And then to
>> be feature complete we would want to add: 1. CQL support for specifying the
>> dimensionality of a column up front, e.g. DENSE FLOAT32(1500).1. Pinecone
>> accepts the first vector as the Correct dimensions, and throws an error if
>> you give it one that doesn’t match, but this doesn’t work in a distributed
>> system where the second vector might go to a different node than the
>> first.  Elastic requires specifying it up front like I propose here.2.
>> Driver support3. Support for updating the vector in a row that hasn’t been
>> flushed yet, either by updating HNSW to support deletes, or by adding some
>> kind of invalidation marker to the overwritten vector.4. More performant
>> inserts.  Currently I have a big synchronized lock around the HNSW graph.
>> So we either need to shard it, like we do for the other SAI indexes, or add
>> fine-grained locking like ConcurrentSkipListMap to make HnswGraphBuilder
>> concurrent-ish.  I prefer the second option, since it allows us to avoid
>> building the graph once in memory and then a second time on flush, but it
>> might be more work than it appears.5. Add index configurability options:
>> similarity function and HNSW parameters M and ef.6. Expose the similarity
>> functions to CQL so you can SELECT x, cosine_similarity(x, query_vector)
>> FROM …Special thanks toMike Adamson and Zhao Yang made substantial
>> contributions to the branch you see here and provided indispensable help
>> understanding SAI.*
>>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
>
>

Re: Adding vector search to SAI with heirarchical navigable small world graph index

Reply via email to