If we agree we’re delivering some general purpose array type, that supports all types as elements (ie, is logicaly equivalent to a frozen list of fixed length, however it is actually implemented), I think we are in technical agreement and it’s just a matter of presentation.At which point I think we
To make sure I understand correctly -- are you saying that you're fine with
a vector type, but you want to see it implemented as a special case of
arrays, or that you are not fine with a vector type because you would
prefer to only add arrays and that should be "good enough" for ML?
On Mon, May 1,
I have no problem with `VECTOR` hanging around forever as an alias for
`NON-NULL FROZEN`. Even without ANN, it makes sense and will stick with
new C* users.
A plug-in system would be great, but it shouldn't hold back this work imho.
On Mon, 1 May 2023 at 22:17, Benedict wrote:
> I have expla
Yes. Plugging in a new type server side is very easy. Adding that type to every client is not.Cassandra already supports plugging in custom types through a jar. What a given client does when encountering a custom type it doesn’t know about depends on the client.I was recently looking at this for D
> A data type plug-in is actually really easy today, I think?
Sadly not, the client reads the class from our schema tables and has to have
duplicate logic to serialize/deserialize results… types are easy to add if you
are ok with client not understanding them (and will some clients fail due to
A data type plug-in is actually really easy today, I think? But, developing further hooks should probably be thought through as they’re necessary. I think in this case it would be simpler to deliver a general purpose type, which is why I’m trying to propose types that would be acceptable.I also thi
> If we want to make an ML-specific data type, it should be in an ML plug-in.
How can we encourage a healthier plug-in ecosystem? As far as I know it's been
pretty anemic historically:
cassandra: https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html
postgres: https://www.postgresql
> I think a simple and easy case can be made for fixed length array types that
> do not seem to create random bits of cruft in the language that dangle by
> themselves should this play not pan out.
If I am understanding you correctly, then a "VECTOR FLOAT[n]” is fine as its a
array type but ha
I have explained repeatedly why I am opposed to ML-specific data types. If we want to make an ML-specific data type, it should be in an ML plug-in. We should not pollute the general purpose language with hastily-considered features that target specific bandwagons - at best partially - no matter how
Yes! What you (David) and Benedict write beautifully supports `VECTOR
FLOAT[n]` imho.
You are definitely bringing up valid implementation details, and that can
be dealt with during patch review. This thread is about the CQL API
addition.
No matter which way the technical review goes with the imp
> I think it is totally reasonable that the ANN patch (and Jonathan) is not
> asked to implement on top of, or towards, other array (or other) new data
> types.
This impacts serialization, if you do not think about this day 1 you then can’t
add later on without having to worry about migration
Has anybody yet claimed it would be hard? Several folk seem ready to jump to
the conclusion that this would be onerous, but as somebody with a good
understanding of the storage layer I can assert with reasonable confidence that
it would not be. As previously stated, the implementation largely al
>
>
> > But suggesting that Jonathan should work on implementing general purpose
> arrays seems to fall outside the scope of this discussion, since the result
> of such work wouldn't even fill the need Jonathan is targeting for here.
>
> Every comment I have made so far I have argued that the v1 wo
> In particular it makes no sense at all from an ML perspective to have vector
> types of anything other than numerics
Back to what Benedict was saying, if the proposal was a ML pluggin, then this
limitation makes sense, but that is not the proposal at hand. If you wish to
change the scope to
By my superficial reading I get the impression that the main distinction is
that vectors don't need to support random access into a single
element/float. I haven't looked at what Jonathan is doing, but I assume,
and it seems Jonathan assumes or knows that this makes implementation both
easier and a
>
> So is the goal here to provide something specific and idiomatic for the ML
> community or is the goal to make a primitive that's C*-centric that then
> another layer can write to? I personally argue for the former; I don't see
> this specific data type going away any time soon.
+1 on this con
I and others have claimed that an array concept will work, since it is isomorphic with a vector. I have seen the following counterclaims:1. Vectors don’t need to support index lookups2. Vectors don’t need to support ordered indexes3. Vectors don’t need to support other types besides floatNone of th
Benedict, I don't quite see why that matters? The argument is merely that
this kind of vector, for this use case, a) is different from arrays, and b)
arrays apparently don't serve the use case well enough (or at all).
Now, if from the above it follows a discussion that a vector type cannot be
a fi
pgvector is a plug-in. If you were proposing a plug-in you could ignore these considerations.On 28 Apr 2023, at 16:58, Jonathan Ellis wrote:I'm proposing a vector data type for ML use cases. It's not the same thing as an array or a list and it's not supposed to be.While it's true that it would b
I'm proposing a vector data type for ML use cases. It's not the same thing
as an array or a list and it's not supposed to be.
While it's true that it would be possible to build a vector type on top of
an array type, it's not necessary to do it that way, and given the lack of
interest in an array
But you’re proposing introducing a general purpose type - this isn’t an ML plug-in, it’s modifying the core language in a manner that makes targeting your workload easier. Which is fine, but that means you have to consider its impact on the general language, not just your target use case.On 28 Apr
That's exactly right.
In particular it makes no sense at all from an ML perspective to have
vector types of anything other than numerics. And as I mentioned in the
POC thread (but I did not mention here), float is overwhelmingly the most
frequently used vector type, to the point that Pinecone (by
This feature may be targeting ML users but it isn’t part of some “ML plug-in” it’s a general purpose type available to all users that happens to permit the use of ANN. So it needs to make sense in a general context, not just to ML users.I also doubt users will struggle with understanding an array o
+1On Thursday, April 27, 2023 at 07:36:19 PM PDT, Caleb Rackliffe
wrote:
I don’t have a lot to add here, other than to say I’m broadly in agreement w/
David on syntax preference, element selectability, and making this a new type
that roughly corresponds to a primitive (non-null-allow
I don’t have a lot to add here, other than to say I’m broadly in agreement w/ David on syntax preference, element selectability, and making this a new type that roughly corresponds to a primitive (non-null-allowing) array.On Apr 27, 2023, at 9:18 PM, Anthony Grasso wrote:It would be strange for t
It would be strange for this declaration to look different from other
collection types. We may want to reconsider using the collection syntax. I
also like the idea of the vector dimensions being declared with the VECTOR
keyword. An alternative syntax option to explore is:
VECTOR[size]
On Fri, 28
>From a machine learning perspective, vectors are a well-known concept that are
>effectively immutable fixed-length n-dimensional values that are then later
>used either as part of a model or in conjunction with a model after the fact.
While we could have this be non-frozen and not call it a vec
> but as you point out it has the problem of allowing nulls.
If nulls are not allowed for the elements, then either we need a) a new type,
or b) add some way to say elements may not be null…. As much as I do like b, I
am leaning towards new type for this use case.
So, to flesh out the type req
That’s a bounded ring buffer, not a fixed length array.This definitely isn’t a tuple because the types are all the same, which is pretty crucial for matrix operations. Matrix libraries generally work on arrays of known dimensionality, or sparse representations.Whether we draw any semantic link betw
On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis wrote:
> It's been a while, so I may be missing something, but do we already have
> fixed-size lists? If not, I don't see why we'd try to make this fit into a
> List-shaped problem.
>
We do not. The proposal got closed as wont-fix
https://issues.ap
It's been a while, so I may be missing something, but do we already have
fixed-size lists? If not, I don't see why we'd try to make this fit into a
List-shaped problem.
A tuple would be a better fit from that perspective, but as you point out
it has the problem of allowing nulls.
The key thing a
If we are going to use FLOAT[N] as sugar for another CQL data type, maybe
tuples are more convenient than lists. So FLOAT[N] could be equivalent to
TUPLE.
Differently to collections, tuples have a fixed size, they are always
frozen and I think they don't support random access. These properties see
>
> My inclination then would be to say you declare an ARRAY (which
> is semantic sugar for FROZEN>). This is very consistent with
> our existing style. We then simply permit such columns to define ANN
> indexes.
>
So long as nulls aren't a problem as David questions, an alternative is:
FLOAT[N
Benedicts comments also makes me question; can any of the values in the vector
be null? The patch sent works with float arrays, so null isn’t possible… is
null not valid for a vector type? If so this would help justify why is a
vector not a array or a list (both allow null)
> On Apr 26, 2023,
Thanks for starting this thread!
> In the initial commits and thread, this was DENSE FLOAT32. Nobody really
> loved that, so we considered a bunch of alternatives, including
>
> - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which
> would make it familiar for many users. H
I think we need to briefly step back and think about what the syntax means and how it fits into existing syntax.It seems that the dimensionality verbiage assumes we’re logically introducing N vector fields, so that each row adopts a value for all of the vector fields or none. But in practice we ar
Hi all,
Splitting this out per the suggestion in the initial VS thread so we can
work on driver support in parallel with the server-side changes.
I propose adding a new data type for vector search indexes:
FLOAT VECTOR[N_DIMENSIONS]
In the initial commits and thread, this was DENSE FLOAT32. Nob
37 matches
Mail list logo