Re: [DISCUSS] New data type for vector search

Jonathan Ellis Tue, 02 May 2023 04:43:16 -0700

To make sure I understand correctly -- are you saying that you're fine with
a vector type, but you want to see it implemented as a special case of
arrays, or that you are not fine with a vector type because you would
prefer to only add arrays and that should be "good enough" for ML?


On Mon, May 1, 2023 at 4:27 PM Benedict <bened...@apache.org> wrote:

> A data type plug-in is actually really easy today, I think? But,
> developing further hooks should probably be thought through as they’re
> necessary.
>
> I think in this case it would be simpler to deliver a general purpose
> type, which is why I’m trying to propose types that would be acceptable.
>
> I also think we’re pretty close to agreement, really?
>
> But if not, let’s flesh out potential plug-in requirements.
>
>
> On 1 May 2023, at 21:58, Josh McKenzie <jmcken...@apache.org> wrote:
>
> 
>
> If we want to make an ML-specific data type, it should be in an ML plug-in.
>
> How can we encourage a healthier plug-in ecosystem? As far as I know it's
> been pretty anemic historically:
>
> cassandra:
> https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html
> postgres: https://www.postgresql.org/docs/current/contrib.html
>
> I'm really interested to hear if there's more in the ecosystem I'm not
> aware of or if there's been strides made in this regard; users in the
> ecosystem being able to write durable extensions to Cassandra that they can
> then distribute and gain momentum could potentially be a great incubator
> for new features or functionality in the ecosystem.
>
> If our support for extensions remains as bare as I believe it to be, I
> wouldn't recommend anyone go that route.
>
> On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:
>
>
> I have explained repeatedly why I am opposed to ML-specific data types. If
> we want to make an ML-specific data type, it should be in an ML plug-in. We
> should not pollute the general purpose language with hastily-considered
> features that target specific bandwagons - at best partially - no matter
> how exciting the bandwagon.
>
> I think a simple and easy case can be made for fixed length array types
> that do not seem to create random bits of cruft in the language that dangle
> by themselves should this play not pan out. This is an easy way for this
> effort to make progress without negatively impacting the language.
>
> That is, unless we want to start supporting totally random types for every
> use case at the top level language layer. I don’t think this is a good
> idea, personally, and I’m quite confident we would now be regretting this
> approach had it been taken for earlier bandwagons.
>
> Nor do I think anyone’s priors about how successful this effort will be
> should matter. As a matter of principle, we should simply never deliver a
> specialist functionality as a high level CQL language feature without at
> least baking it for several years as a plug-in.
>
> On 1 May 2023, at 21:03, Mick Semb Wever <m...@apache.org> wrote:
>
> 
>
> Yes!  What you (David) and Benedict write beautifully supports `VECTOR
> FLOAT[n]` imho.
>
> You are definitely bringing up valid implementation details, and that can
> be dealt with during patch review. This thread is about the CQL API
> addition.
>
> No matter which way the technical review goes with the implementation
> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML
> idiomatic approach and the best long-term CQL API.  It's a win-win
> situation – no matter how you look at it imho it is the best solution api
> wise.
>
> Unless the suggestion is that an ideal implementation can give us a better
> CQL API – but I don't see what that could be.   Maybe the suggestion is we
> deny the possibility of using the VECTOR keyword and bring us back to
> something like `NON-NULL FROZEN<FLOAT[n]>`.   This is odd to me because
> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the
> patch's audience and their idioms.  I have no problems with introducing
> such an alias to meet the ML crowd.
>
> Another way I think of this is
>  `VECTOR FLOAT[n]` is the porcelain ML cql api,
>  `NON-NULL FROZEN<FLOAT[n]>` and `FROZEN<FLOAT[n]>` and `FLOAT[n]` are the
> general-use plumbing cql apis.
>
> This would allow implementation details to be moved out of this thread and
> to the review phase.
>
>
>
>
> On Mon, 1 May 2023 at 20:57, David Capwell <dcapw...@apple.com> wrote:
>
> > I think it is totally reasonable that the ANN patch (and Jonathan) is
> not asked to implement on top of, or towards, other array (or other) new
> data types.
>
>
> This impacts serialization, if you do not think about this day 1 you then
> can’t add later on without having to worry about migration and versioning…
>
> Honestly I wanted to better understand the cost to be generic and the
> impact to ANN, so I took
> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
> and made it handle every requirement I have listed so far (size, null, all
> types)… the current patch has several bugs at the type level that would
> need to be fixed, so had to fix those as well…. Total time to do this was
> 10 minutes… and this includes adding a method "public float[]
> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic
> small (change VectorType.Serializer.instance.deserialize(buffer) to
> type.composeAsFloats(buffer))….
>
> Did this have any impact to the final ByteBuffer?  Nope, it had identical
> layout for the FloatType case, but works for all types…. I didn’t change
> the fact we store the size (felt this could be removed, but then we could
> never support expanding the vector in the future…)
>
> So, given the fact it takes a few minutes to implement all these
> requirements, I do find it very reasonable to push back and say we should
> make sure the new type is not leaking details from a special ANN index…. We
> have spent more time debating this than it takes to support… we also have
> fuzz testing on trunk so just updating
> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new
> type means we get type coverage as well…
>
> I have zero issues helping to review this patch and make sure the testing
> is on-par with existing types (this is a strong requirement for me)
>
>
> > On May 1, 2023, at 10:40 AM, Mick Semb Wever <m...@apache.org> wrote:
> >
> >
> > > But suggesting that Jonathan should work on implementing general
> purpose arrays seems to fall outside the scope of this discussion, since
> the result of such work wouldn't even fill the need Jonathan is targeting
> for here.
> >
> > Every comment I have made so far I have argued that the v1 work doesn’t
> need to do some things, but that the limitations proposed so far are not
> real requirements; there is a big difference between what “could be
> allowed” and what is implemented day one… I am pushing back on what “could
> be allowed”, so far every justification has been that it slows down the ANN
> work…
> >
> > Simple examples of this already exists in C* (every example could be
> enhanced logically, we just have yet to put in the work)
> >
> > * updating a element of a list is only allowed for multi-cell
> > * appending to a list is only allowed for multi-cell
> > * etc.
> >
> > By saying that the type "shall not support", you actively block future
> work and future possibilities...
> >
> >
> >
> > I am coming around strongly to the `VECTOR FLOAT[n]` option.
> >
> > This gives Jonathan the simplest path right now with ths ANN work, while
> also ensuring the CQL API gets the best future potential.
> >
> > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null
> and frozen, and that allows both today and in the future, as desired, for
> its implementation to be entirely different to `FLOAT[n]`.  This addresses
> a number of people's concerns that we meet ML's idioms head on.
> >
> > IMHO it feels like it will fit into the ideal future CQL , where all
> `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and
> maybe VECTOR BYTE[n]). This will also permit in the future
> `FROZEN<primitive[n]>` if we wanted nulls in frozen arrays.
> >
> > I think it is totally reasonable that the ANN patch (and Jonathan) is
> not asked to implement on top of, or towards, other array (or other) new
> data types.
> >
> > I also think it is correct that we think about the evolution of CQL's
> API,  and how it might exist in the future when we have both ml vectors and
> general use arrays.
>
>
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] New data type for vector search

Reply via email to