Re: [POLL] Vector type for ML

Jeremiah D Jordan Wed, 03 May 2023 13:24:29 -0700

> To be clear, I support the general agreement David and Jonathan seem to have 
> reached.


+1 as well.


> On May 3, 2023, at 3:07 PM, Caleb Rackliffe <calebrackli...@gmail.com> wrote:
> 
> To be clear, I support the general agreement David and Jonathan seem to have 
> reached.
> 
> On Wed, May 3, 2023 at 3:05 PM Caleb Rackliffe <calebrackli...@gmail.com 
> <mailto:calebrackli...@gmail.com>> wrote:
>> Did we agree on a CQL syntax?
>> 
>> On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh 
>> <rahul.xavier.si...@gmail.com <mailto:rahul.xavier.si...@gmail.com>> wrote:
>>> I like this approach. Thank you for those working on this vector search 
>>> initiative. 
>>> 
>>> Here's the feedback from my "user" hat for someone who is looking at 
>>> databases / indexes for my next LLM app. 
>>> 
>>> Can I take some python code and go from using an in memory vector store 
>>> like numpy or FAISS to something else? How easy is it for me to take my 
>>> python code and get it to work with this new external service which is no 
>>> longer just a library?
>>> There's also tons of services that I can run on docker e.g. milvus, 
>>> redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle 
>>> when trying to do a lot more data, so I look at Cassandra Vector Search. 
>>> Because I am familiar with SQL , Cassandra looks appealing since I can 
>>> potentially use "cql_agent" lib ( to be created for langchain and we're 
>>> looking into that now) or an existing CassandraVectorStore class?
>>> 
>>> In most of these scenarios, if people are using langchain, llamaindex, the 
>>> underlying implementation is not as important since we shield the user from 
>>> CQL data types except at schema creation and most of this libs can be 
>>> opinionated and just suggest a generic schema. 
>>> 
>>> The ideal world is if I can just dump text into a field and do a natural 
>>> language query on it and have my DB do the embeddings for the document, and 
>>> then for the query for me. For now the libs can manage all that and they do 
>>> that well. We just need the interface to stay consistent and be relatively 
>>> easy to query in CQL. The most popular index in LLM retrieval augmented 
>>> patterns is pinecone. You make an index, you upsert, and then you query. 
>>> It's not assumed that you are also giving it content, though you can send 
>>> it metadata to have the document there. 
>>> 
>>> If we can have a similar workflow e.g. create a table with a vector type OR 
>>> create a table with an existing type and then add an index to it, no one is 
>>> going to sleep over it as long as it works. Having the ability to take a 
>>> table that has data, and then add a vector index doesn't make it any 
>>> different than adding a new field since I've got to calculate the 
>>> embeddings anyways. 
>>> 
>>> Would love to see how the CQL ends up looking like. 
>>> Rahul Singh
>>> Chief Executive Officer | Business Platform Architect
>>> m: 202.905.2818 e: rahul.si...@anant.us <mailto:rahul.si...@anant.us> li: 
>>> http://linkedin.com/in/xingh 
>>> <https://urldefense.com/v3/__http://linkedin.com/in/xingh__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsLl8-Stg$>
>>>  ca: http://calendly.com/xingh 
>>> <https://urldefense.com/v3/__http://calendly.com/xingh__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsBQ99xhQ$>
>>> We create, support, and manage real-time global data & analytics platforms 
>>> for the modern enterprise.
>>> 
>>> Anant | https://anant.us 
>>> <https://urldefense.com/v3/__https://anant.us/__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsh8deoBA$>
>>> 3 Washington Circle, Suite 301
>>> Washington, D.C. 20037
>>> 
>>> http://Cassandra.Link 
>>> <https://urldefense.com/v3/__http://cassandra.link/__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvvbbpb74g$>
>>>  : The best resources for Apache Cassandra
>>> 
>>> 
>>> On Tue, May 2, 2023 at 6:39 PM Patrick McFadin <pmcfa...@gmail.com 
>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>> \o/
>>>> 
>>>> Bring it in team. Group hug. 
>>>> 
>>>> Now if you'll excuse me, I'm going to go build my preso on how Cassandra 
>>>> is the only distributed database you can do vector search in an ACID 
>>>> transaction. 
>>>> 
>>>> Patrick
>>>> 
>>>> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis <jbel...@gmail.com 
>>>> <mailto:jbel...@gmail.com>> wrote:
>>>>> I had a call with David.  We agreed that we want a "vector" data type 
>>>>> with these properties
>>>>> 
>>>>> - Fixed length
>>>>> - No nulls
>>>>> - Random access not supported
>>>>> 
>>>>> Where we disagreed was on my proposal to restrict vectors to only numeric 
>>>>> data.  David's points were that
>>>>> 
>>>>> (1) He has a use case today for a data type with the other vector 
>>>>> properties,
>>>>> (2) It doesn't seem reasonable to create two data types with the same 
>>>>> properties, one of which is restricted to numerics, and
>>>>> (3) The restrictions that I want for numeric vectors make more sense at 
>>>>> the index and function level, than at the type level.
>>>>> 
>>>>> I'm ready to concede that David has the better case here and move forward 
>>>>> with a vector implementation without that restriction.
>>>>> 
>>>>> On Tue, May 2, 2023 at 4:03 PM David Capwell <dcapw...@apple.com 
>>>>> <mailto:dcapw...@apple.com>> wrote:
>>>>>>>  How about it, David? Did you already make this?
>>>>>> 
>>>>>> I checked out the patch, fixed serialize/deserialize, added the 
>>>>>> constraints, then added a composeForFloat(ByteBuffer), with this the 
>>>>>> impact to the POC patch was the following
>>>>>> 
>>>>>> 1) move away from VectorType.instance.serializer().deserialize(bb) to 
>>>>>> type.composeForFloat(bb), both return float[]
>>>>>> 2) change the index validate logic to move away from checking VectorType 
>>>>>> and instead check for that plus the element type == FloatType.  I didn’t 
>>>>>> bother to do this as its trivial
>>>>>> 
>>>>>>> David. End this argument. SHOW THE CODE! 
>>>>>> 
>>>>>> If this argument ends and people are cool with vector supporting 
>>>>>> abstract type, more than glad to help get this in.
>>>>>> 
>>>>>>> On May 2, 2023, at 1:53 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com 
>>>>>>> <mailto:jeremy.hanna1...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> I'm all for bringing more functionality to the masses sooner, but the 
>>>>>>> original idea has a very very specific use case.  Do we have use cases 
>>>>>>> for a general purpose Vector/Array data structure?  If so, awesome.  I 
>>>>>>> just wondered if generalizing provides value, beyond being 
>>>>>>> straightforward to implement.  I'm just trying to be sensitive to the 
>>>>>>> database code maintenance and driver support for general types versus a 
>>>>>>> single type for a specific, well defined purpose.
>>>>>>> 
>>>>>>> If it could easily be a plugin, that's great - but the full picture 
>>>>>>> involves drivers that need to support it or you end up getting binary 
>>>>>>> blobs you have to decode client side and then do stuff with.  So 
>>>>>>> ideally if you have a well defined use case that you can build into the 
>>>>>>> database, having it just be part of the database and associated drivers 
>>>>>>> - that makes the experience much much better.
>>>>>>> 
>>>>>>> I'm not trying to say B couldn't be valuable or that a plugin couldn't 
>>>>>>> be feasible.  I'm just trying to enlarge the picture a bit to see what 
>>>>>>> that means for this use case and for the supporting drivers/clients.
>>>>>>> 
>>>>>>>> On May 2, 2023, at 3:04 PM, Benedict <bened...@apache.org 
>>>>>>>> <mailto:bened...@apache.org>> wrote:
>>>>>>>> 
>>>>>>>> But it’s so trivial it was already implemented by David in the span of 
>>>>>>>> ten minutes? If anything, we’re slowing progress down by refusing to 
>>>>>>>> do the extra types, as we’re busy arguing about it rather than 
>>>>>>>> delivering a feature?
>>>>>>>> 
>>>>>>>> FWIW, my interpretation of the votes today is that we SHOULD NOT 
>>>>>>>> (ever) support types beyond float. Not that we should start with float.
>>>>>>>> 
>>>>>>>> So, this whole debate is a mess, I think. But hey ho.
>>>>>>>> 
>>>>>>>>> On 2 May 2023, at 20:57, Patrick McFadin <pmcfa...@gmail.com 
>>>>>>>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I'll speak up on that one. If you look at my ranked voting, that is 
>>>>>>>>> where my head is. I get accused of scope creep (a lot) and looking at 
>>>>>>>>> the initial proposal Jonathan put on the ML it was mostly "Developers 
>>>>>>>>> are adopting vector search at a furious pace and I think I have a 
>>>>>>>>> simple way of adding support to keep Cassandra relevant for these use 
>>>>>>>>> cases" Instead of just focusing on this use case, I feel the 
>>>>>>>>> arguments have bike shedded into scope creep which means it will take 
>>>>>>>>> forever to get into the project.
>>>>>>>>> 
>>>>>>>>> My preference is to see one thing validated with an MVP and get it 
>>>>>>>>> into the hands of developers sooner so we can continue to iterate 
>>>>>>>>> based on actual usage. 
>>>>>>>>> 
>>>>>>>>> It doesn't say your points are wrong or your opinions are broken, I'm 
>>>>>>>>> voting for what I think will be awesome for users sooner. 
>>>>>>>>> 
>>>>>>>>> Patrick
>>>>>>>>> 
>>>>>>>>> On Tue, May 2, 2023 at 12:29 PM Benedict <bened...@apache.org 
>>>>>>>>> <mailto:bened...@apache.org>> wrote:
>>>>>>>>>> Could folk voting against a general purpose type (that could well be 
>>>>>>>>>> called a vector) briefly explain their reasoning?
>>>>>>>>>> 
>>>>>>>>>> We established in the other thread that it’s technically trivial, 
>>>>>>>>>> meaning folk must think it is strictly superior to only support 
>>>>>>>>>> float rather than eg all numeric types (note: for the type, not the 
>>>>>>>>>> ANN). 
>>>>>>>>>> 
>>>>>>>>>> I am surprised, and the blurbs accompanying votes so far don’t seem 
>>>>>>>>>> to touch on this, mostly just endorsing the idea of a vector.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 2 May 2023, at 20:20, Patrick McFadin <pmcfa...@gmail.com 
>>>>>>>>>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> A > B > C on both polls. 
>>>>>>>>>>> 
>>>>>>>>>>> Having talked to several users in the community that are highly 
>>>>>>>>>>> excited about this change, this gets to what developers want to do 
>>>>>>>>>>> at Cassandra scale: store embeddings and retrieve them. 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña 
>>>>>>>>>>> <adelap...@apache.org <mailto:adelap...@apache.org>> wrote:
>>>>>>>>>>>> A > B > C
>>>>>>>>>>>> 
>>>>>>>>>>>> I don't think that ML is such a niche application that it can't 
>>>>>>>>>>>> have its own CQL data type. Also, vectors are mathematical 
>>>>>>>>>>>> elements that have more applications that ML.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <m...@apache.org 
>>>>>>>>>>>> <mailto:m...@apache.org>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <jbel...@gmail.com 
>>>>>>>>>>>>> <mailto:jbel...@gmail.com>> wrote:
>>>>>>>>>>>>>> Should we add a vector type to Cassandra designed to meet the 
>>>>>>>>>>>>>> needs of machine learning use cases, specifically feature and 
>>>>>>>>>>>>>> embedding vectors for training, inference, and vector search?  
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ML vectors are fixed-dimension (fixed-length) sequences of 
>>>>>>>>>>>>>> numeric types, with no nulls allowed, and with no need for 
>>>>>>>>>>>>>> random access. The ML industry overwhelmingly uses float32 
>>>>>>>>>>>>>> vectors, to the point that the industry-leading special-purpose 
>>>>>>>>>>>>>> vector database ONLY supports that data type.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This poll is to gauge consensus subsequent to the recent 
>>>>>>>>>>>>>> discussion thread at 
>>>>>>>>>>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Please rank the discussed options from most preferred option to 
>>>>>>>>>>>>>> least, e.g., A > B > C (A is my preference, followed by B, 
>>>>>>>>>>>>>> followed by C) or C > B = A (C is my preference, followed by B 
>>>>>>>>>>>>>> or A approximately equally.)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> (A) I am in favor of adding a vector type for floats; I do not 
>>>>>>>>>>>>>> believe we need to tie it to any particular implementation 
>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> (B) I am okay with adding a vector type but I believe we must 
>>>>>>>>>>>>>> add array types that compose with all Cassandra types first, and 
>>>>>>>>>>>>>> make vectors a special case of arrays-without-null-elements.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> (C) I am not in favor of adding a built-in vector type.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> A  > B > C
>>>>>>>>>>>>> 
>>>>>>>>>>>>> B is stated as "must add array types…".  I think this is a bit 
>>>>>>>>>>>>> loaded.  If B was the (A + the implementation needs to be a 
>>>>>>>>>>>>> non-null frozen float32 array, serialisation forward compatible 
>>>>>>>>>>>>> with other frozen arrays later implemented) I would put this 
>>>>>>>>>>>>> before (A).  Especially because it's been shown already this is 
>>>>>>>>>>>>> easy to implement.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Jonathan Ellis
>>>>> co-founder, http://www.datastax.com <http://www.datastax.com/>
>>>>> @spyced

Re: [POLL] Vector type for ML

Reply via email to