[DISCUSS] New data type for vector search

Jonathan Ellis Wed, 26 Apr 2023 07:32:26 -0700

Hi all,

Splitting this out per the suggestion in the initial VS thread so we can
work on driver support in parallel with the server-side changes.


I propose adding a new data type for vector search indexes:

FLOAT VECTOR[N_DIMENSIONS]

In the initial commits and thread, this was DENSE FLOAT32. Nobody really
loved that, so we considered a bunch of alternatives, including

- `FLOAT[N]`: This minimal option resembles C and Java array syntax, which
would make it familiar for many users. However, this syntax raises the
question of why arrays cannot be created for other types.  Additionally,
the expectation for an array is to provide random access to its contents,
which is not supported for vectors.
- `DENSE FLOAT[N]`: This option clarifies that we are supporting dense
vectors, not sparse ones. However, since Lucene had sparse vector support
in the past but removed it for lack of compelling use cases, it is unlikely
that it will be added back, making the "DENSE" qualifier less relevant.
- `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with
the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the
reasons mentioned above.
- `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a
less natural word order.
`VECTOR<FLOAT, N>`: This follows the syntax of our Collections, but again
this would imply that random access is supported, which we want to avoid
doing.
- `VECTOR[N]`: This syntax is not very clear about the vector's contents
and could make it difficult to add other vector types, such as byte vectors
(already supported by Lucene), in the future.

Finally, the original qualifier of 32 in `FLOAT32` was intended to allow
consistency if we add other float types like FLOAT16 or FLOAT64, both of
which are sometimes used in ML. However, we already have a CQL data type
for a 64-bit float (`DOUBLE`), so it would make more sense to add future
variants (which remain hypothetical at this point) along that line instead.

Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best
balance of clarity, conciseness, and extensibility. It is more natural in
its word order than the original proposal and avoids unnecessary
qualifiers, while still being clear about the data type it represents.
Finally, this syntax is straighforwardly extensible should we choose to
support other vector types in the future.

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

[DISCUSS] New data type for vector search

Reply via email to