Hi all, Splitting this out per the suggestion in the initial VS thread so we can work on driver support in parallel with the server-side changes.
I propose adding a new data type for vector search indexes: FLOAT VECTOR[N_DIMENSIONS] In the initial commits and thread, this was DENSE FLOAT32. Nobody really loved that, so we considered a bunch of alternatives, including - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which would make it familiar for many users. However, this syntax raises the question of why arrays cannot be created for other types. Additionally, the expectation for an array is to provide random access to its contents, which is not supported for vectors. - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense vectors, not sparse ones. However, since Lucene had sparse vector support in the past but removed it for lack of compelling use cases, it is unlikely that it will be added back, making the "DENSE" qualifier less relevant. - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the reasons mentioned above. - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less natural word order. `VECTOR<FLOAT, N>`: This follows the syntax of our Collections, but again this would imply that random access is supported, which we want to avoid doing. - `VECTOR[N]`: This syntax is not very clear about the vector's contents and could make it difficult to add other vector types, such as byte vectors (already supported by Lucene), in the future. Finally, the original qualifier of 32 in `FLOAT32` was intended to allow consistency if we add other float types like FLOAT16 or FLOAT64, both of which are sometimes used in ML. However, we already have a CQL data type for a 64-bit float (`DOUBLE`), so it would make more sense to add future variants (which remain hypothetical at this point) along that line instead. Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance of clarity, conciseness, and extensibility. It is more natural in its word order than the original proposal and avoids unnecessary qualifiers, while still being clear about the data type it represents. Finally, this syntax is straighforwardly extensible should we choose to support other vector types in the future. -- Jonathan Ellis co-founder, http://www.datastax.com @spyced