Re: [POLL] Vector type for ML

Caleb Rackliffe Wed, 03 May 2023 13:08:03 -0700

To be clear, I support the general agreement David and Jonathan seem to
have reached.


On Wed, May 3, 2023 at 3:05 PM Caleb Rackliffe <calebrackli...@gmail.com>
wrote:

> Did we agree on a CQL syntax?
>
> On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh <
> rahul.xavier.si...@gmail.com> wrote:
>
>> I like this approach. Thank you for those working on this vector search
>> initiative.
>>
>> Here's the feedback from my "user" hat for someone who is looking at
>> databases / indexes for my next LLM app.
>>
>> Can I take some python code and go from using an in memory vector store
>> like numpy or FAISS to something else? How easy is it for me to take my
>> python code and get it to work with this new external service which is no
>> longer just a library?
>> There's also tons of services that I can run on docker e.g. milvus,
>> redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle
>> when trying to do a lot more data, so I look at Cassandra Vector Search.
>> Because I am familiar with SQL , Cassandra looks appealing since I can
>> potentially use "cql_agent" lib ( to be created for langchain and we're
>> looking into that now) or an existing CassandraVectorStore class?
>>
>> In most of these scenarios, if people are using langchain, llamaindex,
>> the underlying implementation is not as important since we shield the user
>> from CQL data types except at schema creation and most of this libs can be
>> opinionated and just suggest a generic schema.
>>
>> The ideal world is if I can just dump text into a field and do a natural
>> language query on it and have my DB do the embeddings for the document, and
>> then for the query for me. For now the libs can manage all that and they do
>> that well. We just need the interface to stay consistent and be relatively
>> easy to query in CQL. The most popular index in LLM retrieval augmented
>> patterns is pinecone. You make an index, you upsert, and then you query.
>> It's not assumed that you are also giving it content, though you can send
>> it metadata to have the document there.
>>
>> If we can have a similar workflow e.g. create a table with a vector type
>> OR create a table with an existing type and then add an index to it, no one
>> is going to sleep over it as long as it works. Having the ability to take a
>> table that has data, and then add a vector index doesn't make it any
>> different than adding a new field since I've got to calculate the
>> embeddings anyways.
>>
>> Would love to see how the CQL ends up looking like.
>> Rahul Singh
>>
>> Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
>> rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
>> http://calendly.com/xingh
>>
>> *We create, support, and manage real-time global data & analytics
>> platforms for the modern enterprise.*
>>
>> *Anant | https://anant.us <https://anant.us/>*
>>
>> 3 Washington Circle, Suite 301
>>
>> Washington, D.C. 20037
>>
>> *http://Cassandra.Link <http://cassandra.link/>* : The best resources
>> for Apache Cassandra
>>
>>
>> On Tue, May 2, 2023 at 6:39 PM Patrick McFadin <pmcfa...@gmail.com>
>> wrote:
>>
>>> \o/
>>>
>>> Bring it in team. Group hug.
>>>
>>> Now if you'll excuse me, I'm going to go build my preso on how Cassandra
>>> is the only distributed database you can do vector search in an ACID
>>> transaction.
>>>
>>> Patrick
>>>
>>> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis <jbel...@gmail.com> wrote:
>>>
>>>> I had a call with David.  We agreed that we want a "vector" data type
>>>> with these properties
>>>>
>>>> - Fixed length
>>>> - No nulls
>>>> - Random access not supported
>>>>
>>>> Where we disagreed was on my proposal to restrict vectors to only
>>>> numeric data.  David's points were that
>>>>
>>>> (1) He has a use case today for a data type with the other vector
>>>> properties,
>>>> (2) It doesn't seem reasonable to create two data types with the same
>>>> properties, one of which is restricted to numerics, and
>>>> (3) The restrictions that I want for numeric vectors make more sense at
>>>> the index and function level, than at the type level.
>>>>
>>>> I'm ready to concede that David has the better case here and move
>>>> forward with a vector implementation without that restriction.
>>>>
>>>> On Tue, May 2, 2023 at 4:03 PM David Capwell <dcapw...@apple.com>
>>>> wrote:
>>>>
>>>>>  How about it, David? Did you already make this?
>>>>>
>>>>>
>>>>> I checked out the patch, fixed serialize/deserialize, added the
>>>>> constraints, then added a composeForFloat(ByteBuffer), with this the 
>>>>> impact
>>>>> to the POC patch was the following
>>>>>
>>>>> 1) move away from VectorType.instance.serializer().deserialize(bb) to
>>>>> type.composeForFloat(bb), both return float[]
>>>>> 2) change the index validate logic to move away from checking
>>>>> VectorType and instead check for that plus the element type == FloatType.
>>>>> I didn’t bother to do this as its trivial
>>>>>
>>>>> David. End this argument. SHOW THE CODE!
>>>>>
>>>>>
>>>>> If this argument ends and people are cool with vector supporting
>>>>> abstract type, more than glad to help get this in.
>>>>>
>>>>> On May 2, 2023, at 1:53 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I'm all for bringing more functionality to the masses sooner, but the
>>>>> original idea has a very very specific use case.  Do we have use cases for
>>>>> a general purpose Vector/Array data structure?  If so, awesome.  I just
>>>>> wondered if generalizing provides value, beyond being straightforward to
>>>>> implement.  I'm just trying to be sensitive to the database code
>>>>> maintenance and driver support for general types versus a single type for 
>>>>> a
>>>>> specific, well defined purpose.
>>>>>
>>>>> If it could easily be a plugin, that's great - but the full picture
>>>>> involves drivers that need to support it or you end up getting binary 
>>>>> blobs
>>>>> you have to decode client side and then do stuff with.  So ideally if you
>>>>> have a well defined use case that you can build into the database, having
>>>>> it just be part of the database and associated drivers - that makes the
>>>>> experience much much better.
>>>>>
>>>>> I'm not trying to say B couldn't be valuable or that a plugin couldn't
>>>>> be feasible.  I'm just trying to enlarge the picture a bit to see what 
>>>>> that
>>>>> means for this use case and for the supporting drivers/clients.
>>>>>
>>>>> On May 2, 2023, at 3:04 PM, Benedict <bened...@apache.org> wrote:
>>>>>
>>>>> But it’s so trivial it was already implemented by David in the span of
>>>>> ten minutes? If anything, we’re slowing progress down by refusing to do 
>>>>> the
>>>>> extra types, as we’re busy arguing about it rather than delivering a
>>>>> feature?
>>>>>
>>>>> FWIW, my interpretation of the votes today is that we SHOULD NOT
>>>>> (ever) support types beyond float. Not that we should start with float.
>>>>>
>>>>> So, this whole debate is a mess, I think. But hey ho.
>>>>>
>>>>> On 2 May 2023, at 20:57, Patrick McFadin <pmcfa...@gmail.com> wrote:
>>>>>
>>>>> 
>>>>> I'll speak up on that one. If you look at my ranked voting, that is
>>>>> where my head is. I get accused of scope creep (a lot) and looking at the
>>>>> initial proposal Jonathan put on the ML it was mostly "Developers are
>>>>> adopting vector search at a furious pace and I think I have a simple way 
>>>>> of
>>>>> adding support to keep Cassandra relevant for these use cases" Instead of
>>>>> just focusing on this use case, I feel the arguments have bike shedded 
>>>>> into
>>>>> scope creep which means it will take forever to get into the project.
>>>>>
>>>>> My preference is to see one thing validated with an MVP and get it
>>>>> into the hands of developers sooner so we can continue to iterate based on
>>>>> actual usage.
>>>>>
>>>>> It doesn't say your points are wrong or your opinions are broken, I'm
>>>>> voting for what I think will be awesome for users sooner.
>>>>>
>>>>> Patrick
>>>>>
>>>>> On Tue, May 2, 2023 at 12:29 PM Benedict <bened...@apache.org> wrote:
>>>>>
>>>>>> Could folk voting against a general purpose type (that could well be
>>>>>> called a vector) briefly explain their reasoning?
>>>>>>
>>>>>> We established in the other thread that it’s technically trivial,
>>>>>> meaning folk must think it is strictly superior to only support float
>>>>>> rather than eg all numeric types (note: for the type, not the ANN).
>>>>>>
>>>>>> I am surprised, and the blurbs accompanying votes so far don’t seem
>>>>>> to touch on this, mostly just endorsing the idea of a vector.
>>>>>>
>>>>>>
>>>>>> On 2 May 2023, at 20:20, Patrick McFadin <pmcfa...@gmail.com> wrote:
>>>>>>
>>>>>> 
>>>>>> A > B > C on both polls.
>>>>>>
>>>>>> Having talked to several users in the community that are highly
>>>>>> excited about this change, this gets to what developers want to do at
>>>>>> Cassandra scale: store embeddings and retrieve them.
>>>>>>
>>>>>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña <
>>>>>> adelap...@apache.org> wrote:
>>>>>>
>>>>>>> A > B > C
>>>>>>>
>>>>>>> I don't think that ML is such a niche application that it can't have
>>>>>>> its own CQL data type. Also, vectors are mathematical elements that have
>>>>>>> more applications that ML.
>>>>>>>
>>>>>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <m...@apache.org> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <jbel...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Should we add a vector type to Cassandra designed to meet the
>>>>>>>>> needs of machine learning use cases, specifically feature and 
>>>>>>>>> embedding
>>>>>>>>> vectors for training, inference, and vector search?
>>>>>>>>>
>>>>>>>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric
>>>>>>>>> types, with no nulls allowed, and with no need for random access. The 
>>>>>>>>> ML
>>>>>>>>> industry overwhelmingly uses float32 vectors, to the point that the
>>>>>>>>> industry-leading special-purpose vector database ONLY supports that 
>>>>>>>>> data
>>>>>>>>> type.
>>>>>>>>>
>>>>>>>>> This poll is to gauge consensus subsequent to the recent
>>>>>>>>> discussion thread at
>>>>>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>>>>>>>>
>>>>>>>>> Please rank the discussed options from most preferred option to
>>>>>>>>> least, e.g., A > B > C (A is my preference, followed by B, followed 
>>>>>>>>> by C)
>>>>>>>>> or C > B = A (C is my preference, followed by B or A approximately 
>>>>>>>>> equally.)
>>>>>>>>>
>>>>>>>>> (A) I am in favor of adding a vector type for floats; I do not
>>>>>>>>> believe we need to tie it to any particular implementation details.
>>>>>>>>>
>>>>>>>>> (B) I am okay with adding a vector type but I believe we must add
>>>>>>>>> array types that compose with all Cassandra types first, and make 
>>>>>>>>> vectors a
>>>>>>>>> special case of arrays-without-null-elements.
>>>>>>>>>
>>>>>>>>> (C) I am not in favor of adding a built-in vector type.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> A  > B > C
>>>>>>>>
>>>>>>>> B is stated as "must add array types…".  I think this is a bit
>>>>>>>> loaded.  If B was the (A + the implementation needs to be a non-null 
>>>>>>>> frozen
>>>>>>>> float32 array, serialisation forward compatible with other frozen 
>>>>>>>> arrays
>>>>>>>> later implemented) I would put this before (A).  Especially because 
>>>>>>>> it's
>>>>>>>> been shown already this is easy to implement.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Jonathan Ellis
>>>> co-founder, http://www.datastax.com
>>>> @spyced
>>>>
>>>

Re: [POLL] Vector type for ML

Reply via email to