Re: [POLL] Vector type for ML

Benedict Thu, 04 May 2023 03:55:00 -0700

I would expect that the type of index would be specified anyway?

I don’t think it’s good API design to have the field define the index you create - only to shape what is permitted.

A HNSW index is very specific and should be asked for specifically, not implicitly, IMO.

On 4 May 2023, at 11:47, Mike Adamson <madam...@datastax.com> wrote:

For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR should be used to simply imply non-null, as this would be very unintuitive. More logical would be NONNULL, if this is the only condition being applied. Alternatively for arrays we could default to NONNULL and later introduce NULLABLE if we want to permit nulls.

I have a small issue relating to not having a specific VECTOR tag on the data type. The driver behind adding this datatype is the hnsw index that is being added to consume this data. If we have a generic array datatype, what is the expectation going to be for users who create an index on it? The hnsw index will support only floats initially so we would have to reject any non-float arrays if an attempt was made to create an hnsw index on it. While there is no problem with doing this, there would be a problem if, in the future, we allow indexing in arrays in the same way that we index collections. In this case we would then need to have the user select what type of index they want at creation time.

Can I add another proposal that we allow a VECTOR or DENSE (this is a well known term in the ML space) keyword that could be used when the array is going to be used for ML workloads. This would be optional and would function similarly to FROZEN in that it would limit the functionality of the array to ML usage.

On Thu, 4 May 2023 at 09:45, Benedict <bened...@apache.org> wrote:
Hurrah for initial agreement.

For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR should be used to simply imply non-null, as this would be very unintuitive. More logical would be NONNULL, if this is the only condition being applied. Alternatively for arrays we could default to NONNULL and later introduce NULLABLE if we want to permit nulls.

If the word vector is to be used it makes more sense to make it look like a list, so VECTOR<FLOAT, N> as here the word VECTOR is clearly not redundant.

So, I vote:

1) (NON NULL) FLOAT[N]
2) FLOAT[N] (Non null by default)
3) VECTOR<FLOAT, N>

On 4 May 2023, at 08:52, Mick Semb Wever <m...@apache.org> wrote:

Did we agree on a CQL syntax?
I don’t believe there has been a pool on CQL syntax… my understanding reading all the threads is that there are ~4-5 options and non are -1ed, so believe we are waiting for majority rule on this?

Re-reading that thread, IIUC the valid choices remaining are…

1. VECTOR FLOAT[n]
2. FLOAT VECTOR[n]
3. VECTOR<FLOAT,n>
4. VECTOR[n]<FLOAT>
5. ARRAY<FLOAT, n>
6. NON-NULL FROZEN<FLOAT[n]>

Yes I'm putting my preference (1) first ;) because (banging on) if the future of CQL will have FLOAT[n] and FROZEN<FLOAT[n]>, where the VECTOR keyword is: for general cql users; just meaning "non-null and frozen", these gel best together.

Options (5) and (6) are for those that feel we can and should provide this type without introducing the vector keyword.

--
Mike Adamson
Engineering

+1 650 389 6000 | datastax.com
Find DataStax Online:

Re: [POLL] Vector type for ML

Reply via email to