> To be clear, I support the general agreement David and Jonathan seem to have > reached.
+1 as well. > On May 3, 2023, at 3:07 PM, Caleb Rackliffe <calebrackli...@gmail.com> wrote: > > To be clear, I support the general agreement David and Jonathan seem to have > reached. > > On Wed, May 3, 2023 at 3:05 PM Caleb Rackliffe <calebrackli...@gmail.com > <mailto:calebrackli...@gmail.com>> wrote: >> Did we agree on a CQL syntax? >> >> On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh >> <rahul.xavier.si...@gmail.com <mailto:rahul.xavier.si...@gmail.com>> wrote: >>> I like this approach. Thank you for those working on this vector search >>> initiative. >>> >>> Here's the feedback from my "user" hat for someone who is looking at >>> databases / indexes for my next LLM app. >>> >>> Can I take some python code and go from using an in memory vector store >>> like numpy or FAISS to something else? How easy is it for me to take my >>> python code and get it to work with this new external service which is no >>> longer just a library? >>> There's also tons of services that I can run on docker e.g. milvus, >>> redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle >>> when trying to do a lot more data, so I look at Cassandra Vector Search. >>> Because I am familiar with SQL , Cassandra looks appealing since I can >>> potentially use "cql_agent" lib ( to be created for langchain and we're >>> looking into that now) or an existing CassandraVectorStore class? >>> >>> In most of these scenarios, if people are using langchain, llamaindex, the >>> underlying implementation is not as important since we shield the user from >>> CQL data types except at schema creation and most of this libs can be >>> opinionated and just suggest a generic schema. >>> >>> The ideal world is if I can just dump text into a field and do a natural >>> language query on it and have my DB do the embeddings for the document, and >>> then for the query for me. For now the libs can manage all that and they do >>> that well. We just need the interface to stay consistent and be relatively >>> easy to query in CQL. The most popular index in LLM retrieval augmented >>> patterns is pinecone. You make an index, you upsert, and then you query. >>> It's not assumed that you are also giving it content, though you can send >>> it metadata to have the document there. >>> >>> If we can have a similar workflow e.g. create a table with a vector type OR >>> create a table with an existing type and then add an index to it, no one is >>> going to sleep over it as long as it works. Having the ability to take a >>> table that has data, and then add a vector index doesn't make it any >>> different than adding a new field since I've got to calculate the >>> embeddings anyways. >>> >>> Would love to see how the CQL ends up looking like. >>> Rahul Singh >>> Chief Executive Officer | Business Platform Architect >>> m: 202.905.2818 e: rahul.si...@anant.us <mailto:rahul.si...@anant.us> li: >>> http://linkedin.com/in/xingh >>> <https://urldefense.com/v3/__http://linkedin.com/in/xingh__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsLl8-Stg$> >>> ca: http://calendly.com/xingh >>> <https://urldefense.com/v3/__http://calendly.com/xingh__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsBQ99xhQ$> >>> We create, support, and manage real-time global data & analytics platforms >>> for the modern enterprise. >>> >>> Anant | https://anant.us >>> <https://urldefense.com/v3/__https://anant.us/__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsh8deoBA$> >>> 3 Washington Circle, Suite 301 >>> Washington, D.C. 20037 >>> >>> http://Cassandra.Link >>> <https://urldefense.com/v3/__http://cassandra.link/__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvvbbpb74g$> >>> : The best resources for Apache Cassandra >>> >>> >>> On Tue, May 2, 2023 at 6:39 PM Patrick McFadin <pmcfa...@gmail.com >>> <mailto:pmcfa...@gmail.com>> wrote: >>>> \o/ >>>> >>>> Bring it in team. Group hug. >>>> >>>> Now if you'll excuse me, I'm going to go build my preso on how Cassandra >>>> is the only distributed database you can do vector search in an ACID >>>> transaction. >>>> >>>> Patrick >>>> >>>> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis <jbel...@gmail.com >>>> <mailto:jbel...@gmail.com>> wrote: >>>>> I had a call with David. We agreed that we want a "vector" data type >>>>> with these properties >>>>> >>>>> - Fixed length >>>>> - No nulls >>>>> - Random access not supported >>>>> >>>>> Where we disagreed was on my proposal to restrict vectors to only numeric >>>>> data. David's points were that >>>>> >>>>> (1) He has a use case today for a data type with the other vector >>>>> properties, >>>>> (2) It doesn't seem reasonable to create two data types with the same >>>>> properties, one of which is restricted to numerics, and >>>>> (3) The restrictions that I want for numeric vectors make more sense at >>>>> the index and function level, than at the type level. >>>>> >>>>> I'm ready to concede that David has the better case here and move forward >>>>> with a vector implementation without that restriction. >>>>> >>>>> On Tue, May 2, 2023 at 4:03 PM David Capwell <dcapw...@apple.com >>>>> <mailto:dcapw...@apple.com>> wrote: >>>>>>> How about it, David? Did you already make this? >>>>>> >>>>>> I checked out the patch, fixed serialize/deserialize, added the >>>>>> constraints, then added a composeForFloat(ByteBuffer), with this the >>>>>> impact to the POC patch was the following >>>>>> >>>>>> 1) move away from VectorType.instance.serializer().deserialize(bb) to >>>>>> type.composeForFloat(bb), both return float[] >>>>>> 2) change the index validate logic to move away from checking VectorType >>>>>> and instead check for that plus the element type == FloatType. I didn’t >>>>>> bother to do this as its trivial >>>>>> >>>>>>> David. End this argument. SHOW THE CODE! >>>>>> >>>>>> If this argument ends and people are cool with vector supporting >>>>>> abstract type, more than glad to help get this in. >>>>>> >>>>>>> On May 2, 2023, at 1:53 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com >>>>>>> <mailto:jeremy.hanna1...@gmail.com>> wrote: >>>>>>> >>>>>>> I'm all for bringing more functionality to the masses sooner, but the >>>>>>> original idea has a very very specific use case. Do we have use cases >>>>>>> for a general purpose Vector/Array data structure? If so, awesome. I >>>>>>> just wondered if generalizing provides value, beyond being >>>>>>> straightforward to implement. I'm just trying to be sensitive to the >>>>>>> database code maintenance and driver support for general types versus a >>>>>>> single type for a specific, well defined purpose. >>>>>>> >>>>>>> If it could easily be a plugin, that's great - but the full picture >>>>>>> involves drivers that need to support it or you end up getting binary >>>>>>> blobs you have to decode client side and then do stuff with. So >>>>>>> ideally if you have a well defined use case that you can build into the >>>>>>> database, having it just be part of the database and associated drivers >>>>>>> - that makes the experience much much better. >>>>>>> >>>>>>> I'm not trying to say B couldn't be valuable or that a plugin couldn't >>>>>>> be feasible. I'm just trying to enlarge the picture a bit to see what >>>>>>> that means for this use case and for the supporting drivers/clients. >>>>>>> >>>>>>>> On May 2, 2023, at 3:04 PM, Benedict <bened...@apache.org >>>>>>>> <mailto:bened...@apache.org>> wrote: >>>>>>>> >>>>>>>> But it’s so trivial it was already implemented by David in the span of >>>>>>>> ten minutes? If anything, we’re slowing progress down by refusing to >>>>>>>> do the extra types, as we’re busy arguing about it rather than >>>>>>>> delivering a feature? >>>>>>>> >>>>>>>> FWIW, my interpretation of the votes today is that we SHOULD NOT >>>>>>>> (ever) support types beyond float. Not that we should start with float. >>>>>>>> >>>>>>>> So, this whole debate is a mess, I think. But hey ho. >>>>>>>> >>>>>>>>> On 2 May 2023, at 20:57, Patrick McFadin <pmcfa...@gmail.com >>>>>>>>> <mailto:pmcfa...@gmail.com>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> I'll speak up on that one. If you look at my ranked voting, that is >>>>>>>>> where my head is. I get accused of scope creep (a lot) and looking at >>>>>>>>> the initial proposal Jonathan put on the ML it was mostly "Developers >>>>>>>>> are adopting vector search at a furious pace and I think I have a >>>>>>>>> simple way of adding support to keep Cassandra relevant for these use >>>>>>>>> cases" Instead of just focusing on this use case, I feel the >>>>>>>>> arguments have bike shedded into scope creep which means it will take >>>>>>>>> forever to get into the project. >>>>>>>>> >>>>>>>>> My preference is to see one thing validated with an MVP and get it >>>>>>>>> into the hands of developers sooner so we can continue to iterate >>>>>>>>> based on actual usage. >>>>>>>>> >>>>>>>>> It doesn't say your points are wrong or your opinions are broken, I'm >>>>>>>>> voting for what I think will be awesome for users sooner. >>>>>>>>> >>>>>>>>> Patrick >>>>>>>>> >>>>>>>>> On Tue, May 2, 2023 at 12:29 PM Benedict <bened...@apache.org >>>>>>>>> <mailto:bened...@apache.org>> wrote: >>>>>>>>>> Could folk voting against a general purpose type (that could well be >>>>>>>>>> called a vector) briefly explain their reasoning? >>>>>>>>>> >>>>>>>>>> We established in the other thread that it’s technically trivial, >>>>>>>>>> meaning folk must think it is strictly superior to only support >>>>>>>>>> float rather than eg all numeric types (note: for the type, not the >>>>>>>>>> ANN). >>>>>>>>>> >>>>>>>>>> I am surprised, and the blurbs accompanying votes so far don’t seem >>>>>>>>>> to touch on this, mostly just endorsing the idea of a vector. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 2 May 2023, at 20:20, Patrick McFadin <pmcfa...@gmail.com >>>>>>>>>>> <mailto:pmcfa...@gmail.com>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> A > B > C on both polls. >>>>>>>>>>> >>>>>>>>>>> Having talked to several users in the community that are highly >>>>>>>>>>> excited about this change, this gets to what developers want to do >>>>>>>>>>> at Cassandra scale: store embeddings and retrieve them. >>>>>>>>>>> >>>>>>>>>>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña >>>>>>>>>>> <adelap...@apache.org <mailto:adelap...@apache.org>> wrote: >>>>>>>>>>>> A > B > C >>>>>>>>>>>> >>>>>>>>>>>> I don't think that ML is such a niche application that it can't >>>>>>>>>>>> have its own CQL data type. Also, vectors are mathematical >>>>>>>>>>>> elements that have more applications that ML. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <m...@apache.org >>>>>>>>>>>> <mailto:m...@apache.org>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <jbel...@gmail.com >>>>>>>>>>>>> <mailto:jbel...@gmail.com>> wrote: >>>>>>>>>>>>>> Should we add a vector type to Cassandra designed to meet the >>>>>>>>>>>>>> needs of machine learning use cases, specifically feature and >>>>>>>>>>>>>> embedding vectors for training, inference, and vector search? >>>>>>>>>>>>>> >>>>>>>>>>>>>> ML vectors are fixed-dimension (fixed-length) sequences of >>>>>>>>>>>>>> numeric types, with no nulls allowed, and with no need for >>>>>>>>>>>>>> random access. The ML industry overwhelmingly uses float32 >>>>>>>>>>>>>> vectors, to the point that the industry-leading special-purpose >>>>>>>>>>>>>> vector database ONLY supports that data type. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This poll is to gauge consensus subsequent to the recent >>>>>>>>>>>>>> discussion thread at >>>>>>>>>>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please rank the discussed options from most preferred option to >>>>>>>>>>>>>> least, e.g., A > B > C (A is my preference, followed by B, >>>>>>>>>>>>>> followed by C) or C > B = A (C is my preference, followed by B >>>>>>>>>>>>>> or A approximately equally.) >>>>>>>>>>>>>> >>>>>>>>>>>>>> (A) I am in favor of adding a vector type for floats; I do not >>>>>>>>>>>>>> believe we need to tie it to any particular implementation >>>>>>>>>>>>>> details. >>>>>>>>>>>>>> >>>>>>>>>>>>>> (B) I am okay with adding a vector type but I believe we must >>>>>>>>>>>>>> add array types that compose with all Cassandra types first, and >>>>>>>>>>>>>> make vectors a special case of arrays-without-null-elements. >>>>>>>>>>>>>> >>>>>>>>>>>>>> (C) I am not in favor of adding a built-in vector type. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> A > B > C >>>>>>>>>>>>> >>>>>>>>>>>>> B is stated as "must add array types…". I think this is a bit >>>>>>>>>>>>> loaded. If B was the (A + the implementation needs to be a >>>>>>>>>>>>> non-null frozen float32 array, serialisation forward compatible >>>>>>>>>>>>> with other frozen arrays later implemented) I would put this >>>>>>>>>>>>> before (A). Especially because it's been shown already this is >>>>>>>>>>>>> easy to implement. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Jonathan Ellis >>>>> co-founder, http://www.datastax.com <http://www.datastax.com/> >>>>> @spyced