Re: [POLL] Vector type for ML

Caleb Rackliffe Wed, 03 May 2023 13:06:03 -0700

Did we agree on a CQL syntax?

On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh <
rahul.xavier.si...@gmail.com> wrote:


> I like this approach. Thank you for those working on this vector search
> initiative.
>
> Here's the feedback from my "user" hat for someone who is looking at
> databases / indexes for my next LLM app.
>
> Can I take some python code and go from using an in memory vector store
> like numpy or FAISS to something else? How easy is it for me to take my
> python code and get it to work with this new external service which is no
> longer just a library?
> There's also tons of services that I can run on docker e.g. milvus,
> redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle
> when trying to do a lot more data, so I look at Cassandra Vector Search.
> Because I am familiar with SQL , Cassandra looks appealing since I can
> potentially use "cql_agent" lib ( to be created for langchain and we're
> looking into that now) or an existing CassandraVectorStore class?
>
> In most of these scenarios, if people are using langchain, llamaindex, the
> underlying implementation is not as important since we shield the user from
> CQL data types except at schema creation and most of this libs can be
> opinionated and just suggest a generic schema.
>
> The ideal world is if I can just dump text into a field and do a natural
> language query on it and have my DB do the embeddings for the document, and
> then for the query for me. For now the libs can manage all that and they do
> that well. We just need the interface to stay consistent and be relatively
> easy to query in CQL. The most popular index in LLM retrieval augmented
> patterns is pinecone. You make an index, you upsert, and then you query.
> It's not assumed that you are also giving it content, though you can send
> it metadata to have the document there.
>
> If we can have a similar workflow e.g. create a table with a vector type
> OR create a table with an existing type and then add an index to it, no one
> is going to sleep over it as long as it works. Having the ability to take a
> table that has data, and then add a vector index doesn't make it any
> different than adding a new field since I've got to calculate the
> embeddings anyways.
>
> Would love to see how the CQL ends up looking like.
> Rahul Singh
>
> Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
> rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
> http://calendly.com/xingh
>
> *We create, support, and manage real-time global data & analytics
> platforms for the modern enterprise.*
>
> *Anant | https://anant.us <https://anant.us/>*
>
> 3 Washington Circle, Suite 301
>
> Washington, D.C. 20037
>
> *http://Cassandra.Link <http://cassandra.link/>* : The best resources for
> Apache Cassandra
>
>
> On Tue, May 2, 2023 at 6:39 PM Patrick McFadin <pmcfa...@gmail.com> wrote:
>
>> \o/
>>
>> Bring it in team. Group hug.
>>
>> Now if you'll excuse me, I'm going to go build my preso on how Cassandra
>> is the only distributed database you can do vector search in an ACID
>> transaction.
>>
>> Patrick
>>
>> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis <jbel...@gmail.com> wrote:
>>
>>> I had a call with David.  We agreed that we want a "vector" data type
>>> with these properties
>>>
>>> - Fixed length
>>> - No nulls
>>> - Random access not supported
>>>
>>> Where we disagreed was on my proposal to restrict vectors to only
>>> numeric data.  David's points were that
>>>
>>> (1) He has a use case today for a data type with the other vector
>>> properties,
>>> (2) It doesn't seem reasonable to create two data types with the same
>>> properties, one of which is restricted to numerics, and
>>> (3) The restrictions that I want for numeric vectors make more sense at
>>> the index and function level, than at the type level.
>>>
>>> I'm ready to concede that David has the better case here and move
>>> forward with a vector implementation without that restriction.
>>>
>>> On Tue, May 2, 2023 at 4:03 PM David Capwell <dcapw...@apple.com> wrote:
>>>
>>>>  How about it, David? Did you already make this?
>>>>
>>>>
>>>> I checked out the patch, fixed serialize/deserialize, added the
>>>> constraints, then added a composeForFloat(ByteBuffer), with this the impact
>>>> to the POC patch was the following
>>>>
>>>> 1) move away from VectorType.instance.serializer().deserialize(bb) to
>>>> type.composeForFloat(bb), both return float[]
>>>> 2) change the index validate logic to move away from checking
>>>> VectorType and instead check for that plus the element type == FloatType.
>>>> I didn’t bother to do this as its trivial
>>>>
>>>> David. End this argument. SHOW THE CODE!
>>>>
>>>>
>>>> If this argument ends and people are cool with vector supporting
>>>> abstract type, more than glad to help get this in.
>>>>
>>>> On May 2, 2023, at 1:53 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com>
>>>> wrote:
>>>>
>>>> I'm all for bringing more functionality to the masses sooner, but the
>>>> original idea has a very very specific use case.  Do we have use cases for
>>>> a general purpose Vector/Array data structure?  If so, awesome.  I just
>>>> wondered if generalizing provides value, beyond being straightforward to
>>>> implement.  I'm just trying to be sensitive to the database code
>>>> maintenance and driver support for general types versus a single type for a
>>>> specific, well defined purpose.
>>>>
>>>> If it could easily be a plugin, that's great - but the full picture
>>>> involves drivers that need to support it or you end up getting binary blobs
>>>> you have to decode client side and then do stuff with.  So ideally if you
>>>> have a well defined use case that you can build into the database, having
>>>> it just be part of the database and associated drivers - that makes the
>>>> experience much much better.
>>>>
>>>> I'm not trying to say B couldn't be valuable or that a plugin couldn't
>>>> be feasible.  I'm just trying to enlarge the picture a bit to see what that
>>>> means for this use case and for the supporting drivers/clients.
>>>>
>>>> On May 2, 2023, at 3:04 PM, Benedict <bened...@apache.org> wrote:
>>>>
>>>> But it’s so trivial it was already implemented by David in the span of
>>>> ten minutes? If anything, we’re slowing progress down by refusing to do the
>>>> extra types, as we’re busy arguing about it rather than delivering a
>>>> feature?
>>>>
>>>> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
>>>> support types beyond float. Not that we should start with float.
>>>>
>>>> So, this whole debate is a mess, I think. But hey ho.
>>>>
>>>> On 2 May 2023, at 20:57, Patrick McFadin <pmcfa...@gmail.com> wrote:
>>>>
>>>> 
>>>> I'll speak up on that one. If you look at my ranked voting, that is
>>>> where my head is. I get accused of scope creep (a lot) and looking at the
>>>> initial proposal Jonathan put on the ML it was mostly "Developers are
>>>> adopting vector search at a furious pace and I think I have a simple way of
>>>> adding support to keep Cassandra relevant for these use cases" Instead of
>>>> just focusing on this use case, I feel the arguments have bike shedded into
>>>> scope creep which means it will take forever to get into the project.
>>>>
>>>> My preference is to see one thing validated with an MVP and get it into
>>>> the hands of developers sooner so we can continue to iterate based on
>>>> actual usage.
>>>>
>>>> It doesn't say your points are wrong or your opinions are broken, I'm
>>>> voting for what I think will be awesome for users sooner.
>>>>
>>>> Patrick
>>>>
>>>> On Tue, May 2, 2023 at 12:29 PM Benedict <bened...@apache.org> wrote:
>>>>
>>>>> Could folk voting against a general purpose type (that could well be
>>>>> called a vector) briefly explain their reasoning?
>>>>>
>>>>> We established in the other thread that it’s technically trivial,
>>>>> meaning folk must think it is strictly superior to only support float
>>>>> rather than eg all numeric types (note: for the type, not the ANN).
>>>>>
>>>>> I am surprised, and the blurbs accompanying votes so far don’t seem to
>>>>> touch on this, mostly just endorsing the idea of a vector.
>>>>>
>>>>>
>>>>> On 2 May 2023, at 20:20, Patrick McFadin <pmcfa...@gmail.com> wrote:
>>>>>
>>>>> 
>>>>> A > B > C on both polls.
>>>>>
>>>>> Having talked to several users in the community that are highly
>>>>> excited about this change, this gets to what developers want to do at
>>>>> Cassandra scale: store embeddings and retrieve them.
>>>>>
>>>>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña <
>>>>> adelap...@apache.org> wrote:
>>>>>
>>>>>> A > B > C
>>>>>>
>>>>>> I don't think that ML is such a niche application that it can't have
>>>>>> its own CQL data type. Also, vectors are mathematical elements that have
>>>>>> more applications that ML.
>>>>>>
>>>>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <m...@apache.org> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <jbel...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Should we add a vector type to Cassandra designed to meet the needs
>>>>>>>> of machine learning use cases, specifically feature and embedding 
>>>>>>>> vectors
>>>>>>>> for training, inference, and vector search?
>>>>>>>>
>>>>>>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric
>>>>>>>> types, with no nulls allowed, and with no need for random access. The 
>>>>>>>> ML
>>>>>>>> industry overwhelmingly uses float32 vectors, to the point that the
>>>>>>>> industry-leading special-purpose vector database ONLY supports that 
>>>>>>>> data
>>>>>>>> type.
>>>>>>>>
>>>>>>>> This poll is to gauge consensus subsequent to the recent discussion
>>>>>>>> thread at
>>>>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>>>>>>>
>>>>>>>> Please rank the discussed options from most preferred option to
>>>>>>>> least, e.g., A > B > C (A is my preference, followed by B, followed by 
>>>>>>>> C)
>>>>>>>> or C > B = A (C is my preference, followed by B or A approximately 
>>>>>>>> equally.)
>>>>>>>>
>>>>>>>> (A) I am in favor of adding a vector type for floats; I do not
>>>>>>>> believe we need to tie it to any particular implementation details.
>>>>>>>>
>>>>>>>> (B) I am okay with adding a vector type but I believe we must add
>>>>>>>> array types that compose with all Cassandra types first, and make 
>>>>>>>> vectors a
>>>>>>>> special case of arrays-without-null-elements.
>>>>>>>>
>>>>>>>> (C) I am not in favor of adding a built-in vector type.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> A  > B > C
>>>>>>>
>>>>>>> B is stated as "must add array types…".  I think this is a bit
>>>>>>> loaded.  If B was the (A + the implementation needs to be a non-null 
>>>>>>> frozen
>>>>>>> float32 array, serialisation forward compatible with other frozen arrays
>>>>>>> later implemented) I would put this before (A).  Especially because it's
>>>>>>> been shown already this is easy to implement.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>>
>>

Re: [POLL] Vector type for ML

Reply via email to