To be clear, I support the general agreement David and Jonathan seem to have reached.
On Wed, May 3, 2023 at 3:05 PM Caleb Rackliffe <calebrackli...@gmail.com> wrote: > Did we agree on a CQL syntax? > > On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh < > rahul.xavier.si...@gmail.com> wrote: > >> I like this approach. Thank you for those working on this vector search >> initiative. >> >> Here's the feedback from my "user" hat for someone who is looking at >> databases / indexes for my next LLM app. >> >> Can I take some python code and go from using an in memory vector store >> like numpy or FAISS to something else? How easy is it for me to take my >> python code and get it to work with this new external service which is no >> longer just a library? >> There's also tons of services that I can run on docker e.g. milvus, >> redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle >> when trying to do a lot more data, so I look at Cassandra Vector Search. >> Because I am familiar with SQL , Cassandra looks appealing since I can >> potentially use "cql_agent" lib ( to be created for langchain and we're >> looking into that now) or an existing CassandraVectorStore class? >> >> In most of these scenarios, if people are using langchain, llamaindex, >> the underlying implementation is not as important since we shield the user >> from CQL data types except at schema creation and most of this libs can be >> opinionated and just suggest a generic schema. >> >> The ideal world is if I can just dump text into a field and do a natural >> language query on it and have my DB do the embeddings for the document, and >> then for the query for me. For now the libs can manage all that and they do >> that well. We just need the interface to stay consistent and be relatively >> easy to query in CQL. The most popular index in LLM retrieval augmented >> patterns is pinecone. You make an index, you upsert, and then you query. >> It's not assumed that you are also giving it content, though you can send >> it metadata to have the document there. >> >> If we can have a similar workflow e.g. create a table with a vector type >> OR create a table with an existing type and then add an index to it, no one >> is going to sleep over it as long as it works. Having the ability to take a >> table that has data, and then add a vector index doesn't make it any >> different than adding a new field since I've got to calculate the >> embeddings anyways. >> >> Would love to see how the CQL ends up looking like. >> Rahul Singh >> >> Chief Executive Officer | Business Platform Architect m: 202.905.2818 e: >> rahul.si...@anant.us li: http://linkedin.com/in/xingh ca: >> http://calendly.com/xingh >> >> *We create, support, and manage real-time global data & analytics >> platforms for the modern enterprise.* >> >> *Anant | https://anant.us <https://anant.us/>* >> >> 3 Washington Circle, Suite 301 >> >> Washington, D.C. 20037 >> >> *http://Cassandra.Link <http://cassandra.link/>* : The best resources >> for Apache Cassandra >> >> >> On Tue, May 2, 2023 at 6:39 PM Patrick McFadin <pmcfa...@gmail.com> >> wrote: >> >>> \o/ >>> >>> Bring it in team. Group hug. >>> >>> Now if you'll excuse me, I'm going to go build my preso on how Cassandra >>> is the only distributed database you can do vector search in an ACID >>> transaction. >>> >>> Patrick >>> >>> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis <jbel...@gmail.com> wrote: >>> >>>> I had a call with David. We agreed that we want a "vector" data type >>>> with these properties >>>> >>>> - Fixed length >>>> - No nulls >>>> - Random access not supported >>>> >>>> Where we disagreed was on my proposal to restrict vectors to only >>>> numeric data. David's points were that >>>> >>>> (1) He has a use case today for a data type with the other vector >>>> properties, >>>> (2) It doesn't seem reasonable to create two data types with the same >>>> properties, one of which is restricted to numerics, and >>>> (3) The restrictions that I want for numeric vectors make more sense at >>>> the index and function level, than at the type level. >>>> >>>> I'm ready to concede that David has the better case here and move >>>> forward with a vector implementation without that restriction. >>>> >>>> On Tue, May 2, 2023 at 4:03 PM David Capwell <dcapw...@apple.com> >>>> wrote: >>>> >>>>> How about it, David? Did you already make this? >>>>> >>>>> >>>>> I checked out the patch, fixed serialize/deserialize, added the >>>>> constraints, then added a composeForFloat(ByteBuffer), with this the >>>>> impact >>>>> to the POC patch was the following >>>>> >>>>> 1) move away from VectorType.instance.serializer().deserialize(bb) to >>>>> type.composeForFloat(bb), both return float[] >>>>> 2) change the index validate logic to move away from checking >>>>> VectorType and instead check for that plus the element type == FloatType. >>>>> I didn’t bother to do this as its trivial >>>>> >>>>> David. End this argument. SHOW THE CODE! >>>>> >>>>> >>>>> If this argument ends and people are cool with vector supporting >>>>> abstract type, more than glad to help get this in. >>>>> >>>>> On May 2, 2023, at 1:53 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com> >>>>> wrote: >>>>> >>>>> I'm all for bringing more functionality to the masses sooner, but the >>>>> original idea has a very very specific use case. Do we have use cases for >>>>> a general purpose Vector/Array data structure? If so, awesome. I just >>>>> wondered if generalizing provides value, beyond being straightforward to >>>>> implement. I'm just trying to be sensitive to the database code >>>>> maintenance and driver support for general types versus a single type for >>>>> a >>>>> specific, well defined purpose. >>>>> >>>>> If it could easily be a plugin, that's great - but the full picture >>>>> involves drivers that need to support it or you end up getting binary >>>>> blobs >>>>> you have to decode client side and then do stuff with. So ideally if you >>>>> have a well defined use case that you can build into the database, having >>>>> it just be part of the database and associated drivers - that makes the >>>>> experience much much better. >>>>> >>>>> I'm not trying to say B couldn't be valuable or that a plugin couldn't >>>>> be feasible. I'm just trying to enlarge the picture a bit to see what >>>>> that >>>>> means for this use case and for the supporting drivers/clients. >>>>> >>>>> On May 2, 2023, at 3:04 PM, Benedict <bened...@apache.org> wrote: >>>>> >>>>> But it’s so trivial it was already implemented by David in the span of >>>>> ten minutes? If anything, we’re slowing progress down by refusing to do >>>>> the >>>>> extra types, as we’re busy arguing about it rather than delivering a >>>>> feature? >>>>> >>>>> FWIW, my interpretation of the votes today is that we SHOULD NOT >>>>> (ever) support types beyond float. Not that we should start with float. >>>>> >>>>> So, this whole debate is a mess, I think. But hey ho. >>>>> >>>>> On 2 May 2023, at 20:57, Patrick McFadin <pmcfa...@gmail.com> wrote: >>>>> >>>>> >>>>> I'll speak up on that one. If you look at my ranked voting, that is >>>>> where my head is. I get accused of scope creep (a lot) and looking at the >>>>> initial proposal Jonathan put on the ML it was mostly "Developers are >>>>> adopting vector search at a furious pace and I think I have a simple way >>>>> of >>>>> adding support to keep Cassandra relevant for these use cases" Instead of >>>>> just focusing on this use case, I feel the arguments have bike shedded >>>>> into >>>>> scope creep which means it will take forever to get into the project. >>>>> >>>>> My preference is to see one thing validated with an MVP and get it >>>>> into the hands of developers sooner so we can continue to iterate based on >>>>> actual usage. >>>>> >>>>> It doesn't say your points are wrong or your opinions are broken, I'm >>>>> voting for what I think will be awesome for users sooner. >>>>> >>>>> Patrick >>>>> >>>>> On Tue, May 2, 2023 at 12:29 PM Benedict <bened...@apache.org> wrote: >>>>> >>>>>> Could folk voting against a general purpose type (that could well be >>>>>> called a vector) briefly explain their reasoning? >>>>>> >>>>>> We established in the other thread that it’s technically trivial, >>>>>> meaning folk must think it is strictly superior to only support float >>>>>> rather than eg all numeric types (note: for the type, not the ANN). >>>>>> >>>>>> I am surprised, and the blurbs accompanying votes so far don’t seem >>>>>> to touch on this, mostly just endorsing the idea of a vector. >>>>>> >>>>>> >>>>>> On 2 May 2023, at 20:20, Patrick McFadin <pmcfa...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> A > B > C on both polls. >>>>>> >>>>>> Having talked to several users in the community that are highly >>>>>> excited about this change, this gets to what developers want to do at >>>>>> Cassandra scale: store embeddings and retrieve them. >>>>>> >>>>>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña < >>>>>> adelap...@apache.org> wrote: >>>>>> >>>>>>> A > B > C >>>>>>> >>>>>>> I don't think that ML is such a niche application that it can't have >>>>>>> its own CQL data type. Also, vectors are mathematical elements that have >>>>>>> more applications that ML. >>>>>>> >>>>>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <m...@apache.org> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <jbel...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Should we add a vector type to Cassandra designed to meet the >>>>>>>>> needs of machine learning use cases, specifically feature and >>>>>>>>> embedding >>>>>>>>> vectors for training, inference, and vector search? >>>>>>>>> >>>>>>>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric >>>>>>>>> types, with no nulls allowed, and with no need for random access. The >>>>>>>>> ML >>>>>>>>> industry overwhelmingly uses float32 vectors, to the point that the >>>>>>>>> industry-leading special-purpose vector database ONLY supports that >>>>>>>>> data >>>>>>>>> type. >>>>>>>>> >>>>>>>>> This poll is to gauge consensus subsequent to the recent >>>>>>>>> discussion thread at >>>>>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0. >>>>>>>>> >>>>>>>>> Please rank the discussed options from most preferred option to >>>>>>>>> least, e.g., A > B > C (A is my preference, followed by B, followed >>>>>>>>> by C) >>>>>>>>> or C > B = A (C is my preference, followed by B or A approximately >>>>>>>>> equally.) >>>>>>>>> >>>>>>>>> (A) I am in favor of adding a vector type for floats; I do not >>>>>>>>> believe we need to tie it to any particular implementation details. >>>>>>>>> >>>>>>>>> (B) I am okay with adding a vector type but I believe we must add >>>>>>>>> array types that compose with all Cassandra types first, and make >>>>>>>>> vectors a >>>>>>>>> special case of arrays-without-null-elements. >>>>>>>>> >>>>>>>>> (C) I am not in favor of adding a built-in vector type. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> A > B > C >>>>>>>> >>>>>>>> B is stated as "must add array types…". I think this is a bit >>>>>>>> loaded. If B was the (A + the implementation needs to be a non-null >>>>>>>> frozen >>>>>>>> float32 array, serialisation forward compatible with other frozen >>>>>>>> arrays >>>>>>>> later implemented) I would put this before (A). Especially because >>>>>>>> it's >>>>>>>> been shown already this is easy to implement. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>>> >>>> >>>> -- >>>> Jonathan Ellis >>>> co-founder, http://www.datastax.com >>>> @spyced >>>> >>>