To make sure I understand correctly -- are you saying that you're fine with a vector type, but you want to see it implemented as a special case of arrays, or that you are not fine with a vector type because you would prefer to only add arrays and that should be "good enough" for ML?
On Mon, May 1, 2023 at 4:27 PM Benedict <bened...@apache.org> wrote: > A data type plug-in is actually really easy today, I think? But, > developing further hooks should probably be thought through as they’re > necessary. > > I think in this case it would be simpler to deliver a general purpose > type, which is why I’m trying to propose types that would be acceptable. > > I also think we’re pretty close to agreement, really? > > But if not, let’s flesh out potential plug-in requirements. > > > On 1 May 2023, at 21:58, Josh McKenzie <jmcken...@apache.org> wrote: > > > > If we want to make an ML-specific data type, it should be in an ML plug-in. > > How can we encourage a healthier plug-in ecosystem? As far as I know it's > been pretty anemic historically: > > cassandra: > https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html > postgres: https://www.postgresql.org/docs/current/contrib.html > > I'm really interested to hear if there's more in the ecosystem I'm not > aware of or if there's been strides made in this regard; users in the > ecosystem being able to write durable extensions to Cassandra that they can > then distribute and gain momentum could potentially be a great incubator > for new features or functionality in the ecosystem. > > If our support for extensions remains as bare as I believe it to be, I > wouldn't recommend anyone go that route. > > On Mon, May 1, 2023, at 4:17 PM, Benedict wrote: > > > I have explained repeatedly why I am opposed to ML-specific data types. If > we want to make an ML-specific data type, it should be in an ML plug-in. We > should not pollute the general purpose language with hastily-considered > features that target specific bandwagons - at best partially - no matter > how exciting the bandwagon. > > I think a simple and easy case can be made for fixed length array types > that do not seem to create random bits of cruft in the language that dangle > by themselves should this play not pan out. This is an easy way for this > effort to make progress without negatively impacting the language. > > That is, unless we want to start supporting totally random types for every > use case at the top level language layer. I don’t think this is a good > idea, personally, and I’m quite confident we would now be regretting this > approach had it been taken for earlier bandwagons. > > Nor do I think anyone’s priors about how successful this effort will be > should matter. As a matter of principle, we should simply never deliver a > specialist functionality as a high level CQL language feature without at > least baking it for several years as a plug-in. > > On 1 May 2023, at 21:03, Mick Semb Wever <m...@apache.org> wrote: > > > > Yes! What you (David) and Benedict write beautifully supports `VECTOR > FLOAT[n]` imho. > > You are definitely bringing up valid implementation details, and that can > be dealt with during patch review. This thread is about the CQL API > addition. > > No matter which way the technical review goes with the implementation > details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML > idiomatic approach and the best long-term CQL API. It's a win-win > situation – no matter how you look at it imho it is the best solution api > wise. > > Unless the suggestion is that an ideal implementation can give us a better > CQL API – but I don't see what that could be. Maybe the suggestion is we > deny the possibility of using the VECTOR keyword and bring us back to > something like `NON-NULL FROZEN<FLOAT[n]>`. This is odd to me because > `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the > patch's audience and their idioms. I have no problems with introducing > such an alias to meet the ML crowd. > > Another way I think of this is > `VECTOR FLOAT[n]` is the porcelain ML cql api, > `NON-NULL FROZEN<FLOAT[n]>` and `FROZEN<FLOAT[n]>` and `FLOAT[n]` are the > general-use plumbing cql apis. > > This would allow implementation details to be moved out of this thread and > to the review phase. > > > > > On Mon, 1 May 2023 at 20:57, David Capwell <dcapw...@apple.com> wrote: > > > I think it is totally reasonable that the ANN patch (and Jonathan) is > not asked to implement on top of, or towards, other array (or other) new > data types. > > > This impacts serialization, if you do not think about this day 1 you then > can’t add later on without having to worry about migration and versioning… > > Honestly I wanted to better understand the cost to be generic and the > impact to ANN, so I took > https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java > and made it handle every requirement I have listed so far (size, null, all > types)… the current patch has several bugs at the type level that would > need to be fixed, so had to fix those as well…. Total time to do this was > 10 minutes… and this includes adding a method "public float[] > composeAsFloats(ByteBuffer bytes)” which made the change to existing logic > small (change VectorType.Serializer.instance.deserialize(buffer) to > type.composeAsFloats(buffer))…. > > Did this have any impact to the final ByteBuffer? Nope, it had identical > layout for the FloatType case, but works for all types…. I didn’t change > the fact we store the size (felt this could be removed, but then we could > never support expanding the vector in the future…) > > So, given the fact it takes a few minutes to implement all these > requirements, I do find it very reasonable to push back and say we should > make sure the new type is not leaking details from a special ANN index…. We > have spent more time debating this than it takes to support… we also have > fuzz testing on trunk so just updating > org.apache.cassandra.utils.AbstractTypeGenerators to know about this new > type means we get type coverage as well… > > I have zero issues helping to review this patch and make sure the testing > is on-par with existing types (this is a strong requirement for me) > > > > On May 1, 2023, at 10:40 AM, Mick Semb Wever <m...@apache.org> wrote: > > > > > > > But suggesting that Jonathan should work on implementing general > purpose arrays seems to fall outside the scope of this discussion, since > the result of such work wouldn't even fill the need Jonathan is targeting > for here. > > > > Every comment I have made so far I have argued that the v1 work doesn’t > need to do some things, but that the limitations proposed so far are not > real requirements; there is a big difference between what “could be > allowed” and what is implemented day one… I am pushing back on what “could > be allowed”, so far every justification has been that it slows down the ANN > work… > > > > Simple examples of this already exists in C* (every example could be > enhanced logically, we just have yet to put in the work) > > > > * updating a element of a list is only allowed for multi-cell > > * appending to a list is only allowed for multi-cell > > * etc. > > > > By saying that the type "shall not support", you actively block future > work and future possibilities... > > > > > > > > I am coming around strongly to the `VECTOR FLOAT[n]` option. > > > > This gives Jonathan the simplest path right now with ths ANN work, while > also ensuring the CQL API gets the best future potential. > > > > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null > and frozen, and that allows both today and in the future, as desired, for > its implementation to be entirely different to `FLOAT[n]`. This addresses > a number of people's concerns that we meet ML's idioms head on. > > > > IMHO it feels like it will fit into the ideal future CQL , where all > `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and > maybe VECTOR BYTE[n]). This will also permit in the future > `FROZEN<primitive[n]>` if we wanted nulls in frozen arrays. > > > > I think it is totally reasonable that the ANN patch (and Jonathan) is > not asked to implement on top of, or towards, other array (or other) new > data types. > > > > I also think it is correct that we think about the evolution of CQL's > API, and how it might exist in the future when we have both ml vectors and > general use arrays. > > > -- Jonathan Ellis co-founder, http://www.datastax.com @spyced