Re: [POLL] Vector type for ML

Patrick McFadin Tue, 02 May 2023 15:39:37 -0700

\o/

Bring it in team. Group hug.


Now if you'll excuse me, I'm going to go build my preso on how Cassandra is
the only distributed database you can do vector search in an ACID
transaction.

Patrick

On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis <[email protected]> wrote:

> I had a call with David.  We agreed that we want a "vector" data type with
> these properties
>
> - Fixed length
> - No nulls
> - Random access not supported
>
> Where we disagreed was on my proposal to restrict vectors to only numeric
> data.  David's points were that
>
> (1) He has a use case today for a data type with the other vector
> properties,
> (2) It doesn't seem reasonable to create two data types with the same
> properties, one of which is restricted to numerics, and
> (3) The restrictions that I want for numeric vectors make more sense at
> the index and function level, than at the type level.
>
> I'm ready to concede that David has the better case here and move forward
> with a vector implementation without that restriction.
>
> On Tue, May 2, 2023 at 4:03 PM David Capwell <[email protected]> wrote:
>
>>  How about it, David? Did you already make this?
>>
>>
>> I checked out the patch, fixed serialize/deserialize, added the
>> constraints, then added a composeForFloat(ByteBuffer), with this the impact
>> to the POC patch was the following
>>
>> 1) move away from VectorType.instance.serializer().deserialize(bb) to
>> type.composeForFloat(bb), both return float[]
>> 2) change the index validate logic to move away from checking VectorType
>> and instead check for that plus the element type == FloatType.  I didn’t
>> bother to do this as its trivial
>>
>> David. End this argument. SHOW THE CODE!
>>
>>
>> If this argument ends and people are cool with vector supporting abstract
>> type, more than glad to help get this in.
>>
>> On May 2, 2023, at 1:53 PM, Jeremy Hanna <[email protected]>
>> wrote:
>>
>> I'm all for bringing more functionality to the masses sooner, but the
>> original idea has a very very specific use case.  Do we have use cases for
>> a general purpose Vector/Array data structure?  If so, awesome.  I just
>> wondered if generalizing provides value, beyond being straightforward to
>> implement.  I'm just trying to be sensitive to the database code
>> maintenance and driver support for general types versus a single type for a
>> specific, well defined purpose.
>>
>> If it could easily be a plugin, that's great - but the full picture
>> involves drivers that need to support it or you end up getting binary blobs
>> you have to decode client side and then do stuff with.  So ideally if you
>> have a well defined use case that you can build into the database, having
>> it just be part of the database and associated drivers - that makes the
>> experience much much better.
>>
>> I'm not trying to say B couldn't be valuable or that a plugin couldn't be
>> feasible.  I'm just trying to enlarge the picture a bit to see what that
>> means for this use case and for the supporting drivers/clients.
>>
>> On May 2, 2023, at 3:04 PM, Benedict <[email protected]> wrote:
>>
>> But it’s so trivial it was already implemented by David in the span of
>> ten minutes? If anything, we’re slowing progress down by refusing to do the
>> extra types, as we’re busy arguing about it rather than delivering a
>> feature?
>>
>> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
>> support types beyond float. Not that we should start with float.
>>
>> So, this whole debate is a mess, I think. But hey ho.
>>
>> On 2 May 2023, at 20:57, Patrick McFadin <[email protected]> wrote:
>>
>> 
>> I'll speak up on that one. If you look at my ranked voting, that is where
>> my head is. I get accused of scope creep (a lot) and looking at the initial
>> proposal Jonathan put on the ML it was mostly "Developers are adopting
>> vector search at a furious pace and I think I have a simple way of adding
>> support to keep Cassandra relevant for these use cases" Instead of just
>> focusing on this use case, I feel the arguments have bike shedded into
>> scope creep which means it will take forever to get into the project.
>>
>> My preference is to see one thing validated with an MVP and get it into
>> the hands of developers sooner so we can continue to iterate based on
>> actual usage.
>>
>> It doesn't say your points are wrong or your opinions are broken, I'm
>> voting for what I think will be awesome for users sooner.
>>
>> Patrick
>>
>> On Tue, May 2, 2023 at 12:29 PM Benedict <[email protected]> wrote:
>>
>>> Could folk voting against a general purpose type (that could well be
>>> called a vector) briefly explain their reasoning?
>>>
>>> We established in the other thread that it’s technically trivial,
>>> meaning folk must think it is strictly superior to only support float
>>> rather than eg all numeric types (note: for the type, not the ANN).
>>>
>>> I am surprised, and the blurbs accompanying votes so far don’t seem to
>>> touch on this, mostly just endorsing the idea of a vector.
>>>
>>>
>>> On 2 May 2023, at 20:20, Patrick McFadin <[email protected]> wrote:
>>>
>>> 
>>> A > B > C on both polls.
>>>
>>> Having talked to several users in the community that are highly excited
>>> about this change, this gets to what developers want to do at Cassandra
>>> scale: store embeddings and retrieve them.
>>>
>>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña <[email protected]>
>>> wrote:
>>>
>>>> A > B > C
>>>>
>>>> I don't think that ML is such a niche application that it can't have
>>>> its own CQL data type. Also, vectors are mathematical elements that have
>>>> more applications that ML.
>>>>
>>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <[email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <[email protected]> wrote:
>>>>>
>>>>>> Should we add a vector type to Cassandra designed to meet the needs
>>>>>> of machine learning use cases, specifically feature and embedding vectors
>>>>>> for training, inference, and vector search?
>>>>>>
>>>>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric
>>>>>> types, with no nulls allowed, and with no need for random access. The ML
>>>>>> industry overwhelmingly uses float32 vectors, to the point that the
>>>>>> industry-leading special-purpose vector database ONLY supports that data
>>>>>> type.
>>>>>>
>>>>>> This poll is to gauge consensus subsequent to the recent discussion
>>>>>> thread at
>>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>>>>>
>>>>>> Please rank the discussed options from most preferred option to
>>>>>> least, e.g., A > B > C (A is my preference, followed by B, followed by C)
>>>>>> or C > B = A (C is my preference, followed by B or A approximately 
>>>>>> equally.)
>>>>>>
>>>>>> (A) I am in favor of adding a vector type for floats; I do not
>>>>>> believe we need to tie it to any particular implementation details.
>>>>>>
>>>>>> (B) I am okay with adding a vector type but I believe we must add
>>>>>> array types that compose with all Cassandra types first, and make 
>>>>>> vectors a
>>>>>> special case of arrays-without-null-elements.
>>>>>>
>>>>>> (C) I am not in favor of adding a built-in vector type.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> A  > B > C
>>>>>
>>>>> B is stated as "must add array types…".  I think this is a bit
>>>>> loaded.  If B was the (A + the implementation needs to be a non-null 
>>>>> frozen
>>>>> float32 array, serialisation forward compatible with other frozen arrays
>>>>> later implemented) I would put this before (A).  Especially because it's
>>>>> been shown already this is easy to implement.
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: [POLL] Vector type for ML

Reply via email to