Re: [POLL] Vector type for ML

Patrick McFadin Tue, 02 May 2023 13:33:22 -0700

Yeah, it's a bit of a mess but mailing list yo. People reading this would
have no idea we are friends. ;) (Which we are, for anyone reading this
later!)


I must have missed the point of this already being done. How about it,
David? Did you already make this?

"FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
support types beyond float. Not that we should start with float"
That is not my interpretation and I can definitely see how that may be
frustrating. If B is pretty much done then we are good. My concern, as
noted earlier, is the scope creep component that will delay this happening
for much longer.

David. End this argument. SHOW THE CODE!

Patrick


On Tue, May 2, 2023 at 1:04 PM Benedict <[email protected]> wrote:

> But it’s so trivial it was already implemented by David in the span of ten
> minutes? If anything, we’re slowing progress down by refusing to do the
> extra types, as we’re busy arguing about it rather than delivering a
> feature?
>
> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
> support types beyond float. Not that we should start with float.
>
> So, this whole debate is a mess, I think. But hey ho.
>
> On 2 May 2023, at 20:57, Patrick McFadin <[email protected]> wrote:
>
> 
> I'll speak up on that one. If you look at my ranked voting, that is where
> my head is. I get accused of scope creep (a lot) and looking at the initial
> proposal Jonathan put on the ML it was mostly "Developers are adopting
> vector search at a furious pace and I think I have a simple way of adding
> support to keep Cassandra relevant for these use cases" Instead of just
> focusing on this use case, I feel the arguments have bike shedded into
> scope creep which means it will take forever to get into the project.
>
> My preference is to see one thing validated with an MVP and get it into
> the hands of developers sooner so we can continue to iterate based on
> actual usage.
>
> It doesn't say your points are wrong or your opinions are broken, I'm
> voting for what I think will be awesome for users sooner.
>
> Patrick
>
> On Tue, May 2, 2023 at 12:29 PM Benedict <[email protected]> wrote:
>
>> Could folk voting against a general purpose type (that could well be
>> called a vector) briefly explain their reasoning?
>>
>> We established in the other thread that it’s technically trivial, meaning
>> folk must think it is strictly superior to only support float rather than
>> eg all numeric types (note: for the type, not the ANN).
>>
>> I am surprised, and the blurbs accompanying votes so far don’t seem to
>> touch on this, mostly just endorsing the idea of a vector.
>>
>>
>> On 2 May 2023, at 20:20, Patrick McFadin <[email protected]> wrote:
>>
>> 
>> A > B > C on both polls.
>>
>> Having talked to several users in the community that are highly excited
>> about this change, this gets to what developers want to do at Cassandra
>> scale: store embeddings and retrieve them.
>>
>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña <[email protected]>
>> wrote:
>>
>>> A > B > C
>>>
>>> I don't think that ML is such a niche application that it can't have its
>>> own CQL data type. Also, vectors are mathematical elements that have more
>>> applications that ML.
>>>
>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <[email protected]> wrote:
>>>>
>>>>> Should we add a vector type to Cassandra designed to meet the needs of
>>>>> machine learning use cases, specifically feature and embedding vectors for
>>>>> training, inference, and vector search?
>>>>>
>>>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric
>>>>> types, with no nulls allowed, and with no need for random access. The ML
>>>>> industry overwhelmingly uses float32 vectors, to the point that the
>>>>> industry-leading special-purpose vector database ONLY supports that data
>>>>> type.
>>>>>
>>>>> This poll is to gauge consensus subsequent to the recent discussion
>>>>> thread at
>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>>>>
>>>>> Please rank the discussed options from most preferred option to least,
>>>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > 
>>>>> B
>>>>> = A (C is my preference, followed by B or A approximately equally.)
>>>>>
>>>>> (A) I am in favor of adding a vector type for floats; I do not believe
>>>>> we need to tie it to any particular implementation details.
>>>>>
>>>>> (B) I am okay with adding a vector type but I believe we must add
>>>>> array types that compose with all Cassandra types first, and make vectors 
>>>>> a
>>>>> special case of arrays-without-null-elements.
>>>>>
>>>>> (C) I am not in favor of adding a built-in vector type.
>>>>>
>>>>
>>>>
>>>>
>>>> A  > B > C
>>>>
>>>> B is stated as "must add array types…".  I think this is a bit loaded.
>>>> If B was the (A + the implementation needs to be a non-null frozen float32
>>>> array, serialisation forward compatible with other frozen arrays later
>>>> implemented) I would put this before (A).  Especially because it's been
>>>> shown already this is easy to implement.
>>>>
>>>>
>>>>
>>>

Re: [POLL] Vector type for ML

Reply via email to