Re: [POLL] Vector type for ML

David Capwell Fri, 05 May 2023 09:43:15 -0700

Went through and created a spreed sheet of current votes… For Patric and Mike, 
I don’t see a clear vote, so I put a ? where I “think” your preference is… for 
Mick, I only put one vote as the list looked like a summary, but you mentioned 
the first was your preference


Syntax
Jonathan Ellis
David Capwell
Josh McKenzie
Caleb Rackliffe
Patrick McFadin
Brandon Williams
Mike Adamson
Benedict
Mick Semb Wever
VECTOR<type, dimension>
1
2
2


1
?
3

DENSE VECTOR<type, dimension>
2
1


?

?


type[dimension]
3
3
3
1

3

2

DENSE_VECTOR<type, dimension>


1






NON NULL <type>[dimention]

1





1

VECTOR type[n]





2


1
ARRAY<type, n>









NON-NULL FROZEN<type[n]>










1 = top pick
2 = second pick
3 = third pick

Let me know if I am missing anyone, or if I have bad data

> On May 5, 2023, at 9:23 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
> 
> +10 for not inflicting unwieldy keywords on ML users.
> 
> Re Josh's summary, mostly agreed, my only objection to adding the DENSE 
> keyword is that I don't see a foreseeable future where we also support sparse 
> vectors, so it would end up being unnecessary extra verbosity.  So my 
> preference would be
> 
> 1. VECTOR<type, dimension>
> 2. DENSE VECTOR<type, dimension> (space instead of underscore, SQL isn't 
> afraid of spaces)
> 3. type[dimension]
> 
> On Fri, May 5, 2023 at 10:54 AM Patrick McFadin <pmcfa...@gmail.com 
> <mailto:pmcfa...@gmail.com>> wrote:
>> I hope we are willing to consider developers that use our system because if 
>> I had to teach people to use "NON-NULL FROZEN<TYPE[n]>" I'm pretty sure the 
>> response would be:
>> 
>> Did you tell me to go write a distributed map-reduce job in Erlang? I 
>> beleive I did, Bob.  
>> 
>> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie <jmcken...@apache.org 
>> <mailto:jmcken...@apache.org>> wrote:
>>> Idiomatically, to my mind, there's a question of "what space are we 
>>> thinking about this datatype in"?
>>> 
>>> - In the context of mathematics, nullability in a vector would be 0
>>> - In the context of Cassandra, nullability tends to mean a tombstone (or 
>>> nothing)
>>> - In the context of programming languages, it's all over the place
>>> 
>>> Given many models are exploring quantizing to int8 and other data types, 
>>> there's definitely the "support other data types easily in the future" 
>>> piece to me we need to keep in mind.
>>> 
>>> So with the above and the "meet the user where they are and don't make them 
>>> understand more of Cassandra than absolutely critical to use it", I lean:
>>> 
>>> 1. DENSE_VECTOR<type, dimension>
>>> 2. VECTOR<type, dimension>
>>> 3. type[dimension]
>>> 
>>> This leaves the path open for us to expand on it in the future with sparse 
>>> support and allows us to introduce some semantics that indicate idioms 
>>> around nullability for the users coming from a different space.
>>> 
>>> "NON-NULL FROZEN<TYPE[n]>" is strictly correct, however it requires 
>>> understanding idioms of how Cassandra thinks about data (nulls mean 
>>> different things to us, we have differences between frozen and non-frozen 
>>> due to constraints in our storage engine and materialization of data, etc) 
>>> that get in the way of users doing things in the pattern they're familiar 
>>> with without learning more about the DB than they're probably looking to 
>>> learn. Historically this has been a challenge for us in adoption; the 
>>> classic "Why can't I just write and delete and write as much as I want? Why 
>>> are deletes filling up my disk?" problem comes to mind.
>>> 
>>> I'd also be happy with us supporting:
>>> * NON-NULL FROZEN<TYPE[n]>
>>> * DENSE_VECTOR<type, dimension> as syntactic sugar for the above
>>> 
>>> If getting into the "built-in syntactic sugar mapping for communities and 
>>> specific use-cases" is something we're willing to consider.
>>> 
>>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>>>> I think we are still discussing implementation here when I'm talking about 
>>>> developer experience. I want developers to adopt this quickly, easily and 
>>>> be successful. Vector search is already a thing. People use it every day. 
>>>> A successful outcome, in my view, is developers picking up this feature 
>>>> without reading a manual. (Because they don't anyway and get in trouble) I 
>>>> did some more extensive research about what other DBs are using for 
>>>> syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
>>>> 
>>>> Pinecone[1] - dense_vector, sparse_vector
>>>> Elastic[2]: dense_vector
>>>> Milvus[3]: float_vector, binary_vector
>>>> pgvector[4]: vector
>>>> Weaviate[5]: Different approach. All typed arrays can be indexed
>>>> 
>>>> Based on that I'm advocating a similar syntax:
>>>> 
>>>> - DENSE VECTOR
>>>> or
>>>> - VECTOR
>>>> 
>>>> [1] https://docs.pinecone.io/docs/hybrid-search
>>>> [2] 
>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
>>>> [3] https://milvus.io/docs/create_collection.md
>>>> [4] https://github.com/pgvector/pgvector
>>>> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
>>>> 
>>>> On Fri, May 5, 2023 at 6:07 AM Mike Adamson <madam...@datastax.com 
>>>> <mailto:madam...@datastax.com>> wrote:
>>>> Then we can have the indexing apparatus only accept frozen<float[n]> for 
>>>> the HSNW case.
>>>> I'm inclined to agree with Benedict that the index will need to be 
>>>> specifically select by option rather than inferred based on type. As such 
>>>> there is no real reason for the frozen requirement on the type. The hnsw 
>>>> index can be built just as easily from a non-frozen array.
>>>> 
>>>> I am in favour of enforcing non-null on the elements of an array by 
>>>> default. I would prefer that allowing nulls in the array would be a later 
>>>> addition if and when a use case arose for it.
>>>> 
>>>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe <calebrackli...@gmail.com 
>>>> <mailto:calebrackli...@gmail.com>> wrote:
>>>> Even in the ML case, sparse can just mean zeros rather than nulls, and 
>>>> they should compress similarly anyway.
>>>> 
>>>> If we really want null values, I'd rather leave that in collections space.
>>>> 
>>>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe <calebrackli...@gmail.com 
>>>> <mailto:calebrackli...@gmail.com>> wrote:
>>>> I actually still prefer type[dimension], because I think I intuitively 
>>>> read this as a primitive (meaning no null elements) array. Then we can 
>>>> have the indexing apparatus only accept frozen<float[n]> for the HSNW case.
>>>> 
>>>> If that isn't intuitive to anyone else, I don't really have a strong 
>>>> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One 
>>>> should indicate single vs. multi-cell, and the other the presence or 
>>>> absence of nulls/zeros/whatever.
>>>> 
>>>> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin <pmcfa...@gmail.com 
>>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>> I agree with David's reasoning and the use of DENSE (and maybe eventually 
>>>> SPARSE). This is terminology well established in the data world, and it 
>>>> would lead to much easier adoption from users. VECTOR is close, but I can 
>>>> see having to create a lot of content around "How to use it and not get in 
>>>> trouble." (I have a lot of that content already)
>>>> 
>>>>  - We don't have to explain what it is. A lot of prior art out there 
>>>> already [1][2][3]
>>>>  - We're matching an established term with what users would expect. No 
>>>> surprises. 
>>>>  - Shorter ramp-up time for users. Cassandra is being modernized.
>>>> 
>>>> The implementation is flexible, but the interface should empower our users 
>>>> to be awesome. 
>>>> 
>>>> Patrick
>>>> 
>>>> 1 - 
>>>> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
>>>>  
>>>> <https://urldefense.com/v3/__https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud6ieKGQw$>
>>>> 2 - 
>>>> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
>>>>  
>>>> <https://urldefense.com/v3/__https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ue1o2CO2Q$>
>>>> 3 - 
>>>> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/ 
>>>> <https://urldefense.com/v3/__https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud3U6Hw5A$>
>>>> 
>>>> On Thu, May 4, 2023 at 10:25 AM David Capwell <dcapw...@apple.com 
>>>> <mailto:dcapw...@apple.com>> wrote:
>>>> My views have changed over time on syntax and I feel type[dimention] may 
>>>> not be the best, so it has gone lower in my own personal ranking… this is 
>>>> my current preference
>>>> 
>>>> 1) DENSE <type>[dimention] | NON NULL <type>[dimention]
>>>> 2) VECTOR<type, dimention>
>>>> 3) type[dimention]
>>>> 
>>>> My reasoning for this order
>>>> 
>>>> * type[dimention] looks like syntax sugar for array<type, dimention>, so 
>>>> users may assume list/array semantics, but we limit to non-null elements 
>>>> in a frozen array
>>>> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct type 
>>>> makes more sense… this also leads to a possible future of VECTOR<type> 
>>>> which is the non-fixed length version of this type.  What makes VECTOR 
>>>> different from list/array?  non-null elements and is frozen.  I don’t feel 
>>>> that VECTOR really tells users to expect non-null or frozen semantics, as 
>>>> there exists different VECTOR types for those reasons (sparse vs dense)… 
>>>> * DENSE may be confusing for people coming from languages where this just 
>>>> means “sequential layout”, which is what our frozen array/list already 
>>>> are… but since the target user is coming from a ML background, this 
>>>> shouldn’t offer much confusion.  DENSE just means FROZEN in Cassandra, 
>>>> with NON NULL elements (SPARSE allows for NULL and isn’t frozen)… So DENSE 
>>>> just acts as syntax sugar for frozen<non null type[dimention]>
>>>> 
>>>> 
>>>>> On May 4, 2023, at 4:13 AM, Brandon Williams <dri...@gmail.com 
>>>>> <mailto:dri...@gmail.com>> wrote:
>>>>> 
>>>>> 1. VECTOR<FLOAT,n>
>>>>> 2. VECTOR FLOAT[n]
>>>>> 3. FLOAT[N]   (Non null by default)
>>>>> 
>>>>> Redundant or not, I think having the VECTOR keyword helps signify what
>>>>> the app is generally about and helps get buy-in from ML stakeholders.
>>>>> 
>>>>> On Thu, May 4, 2023 at 3:45 AM Benedict <bened...@apache.org 
>>>>> <mailto:bened...@apache.org>> wrote:
>>>>>> 
>>>>>> Hurrah for initial agreement.
>>>>>> 
>>>>>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], 
>>>>>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t 
>>>>>> think VECTOR should be used to simply imply non-null, as this would be 
>>>>>> very unintuitive. More logical would be NONNULL, if this is the only 
>>>>>> condition being applied. Alternatively for arrays we could default to 
>>>>>> NONNULL and later introduce NULLABLE if we want to permit nulls.
>>>>>> 
>>>>>> If the word vector is to be used it makes more sense to make it look 
>>>>>> like a list, so VECTOR<FLOAT, N> as here the word VECTOR is clearly not 
>>>>>> redundant.
>>>>>> 
>>>>>> So, I vote:
>>>>>> 
>>>>>> 1) (NON NULL) FLOAT[N]
>>>>>> 2) FLOAT[N]   (Non null by default)
>>>>>> 3) VECTOR<FLOAT, N>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 4 May 2023, at 08:52, Mick Semb Wever <m...@apache.org 
>>>>>> <mailto:m...@apache.org>> wrote:
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Did we agree on a CQL syntax?
>>>>>>> 
>>>>>>> I don’t believe there has been a pool on CQL syntax… my understanding 
>>>>>>> reading all the threads is that there are ~4-5 options and non are 
>>>>>>> -1ed, so believe we are waiting for majority rule on this?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Re-reading that thread, IIUC the valid choices remaining are…
>>>>>> 
>>>>>> 1. VECTOR FLOAT[n]
>>>>>> 2. FLOAT VECTOR[n]
>>>>>> 3. VECTOR<FLOAT,n>
>>>>>> 4. VECTOR[n]<FLOAT>
>>>>>> 5. ARRAY<FLOAT, n>
>>>>>> 6. NON-NULL FROZEN<FLOAT[n]>
>>>>>> 
>>>>>> 
>>>>>> Yes I'm putting my preference (1) first ;) because (banging on) if the 
>>>>>> future of CQL will have FLOAT[n] and FROZEN<FLOAT[n]>, where the VECTOR 
>>>>>> keyword is: for general cql users; just meaning "non-null and frozen", 
>>>>>> these gel best together.
>>>>>> 
>>>>>> Options (5) and (6) are for those that feel we can and should provide 
>>>>>> this type without introducing the vector keyword.
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>>  <https://www.datastax.com/>
>>>> Mike Adamson
>>>> Engineering
>>>> +1 650 389 6000 <tel:16503896000> | datastax.com 
>>>> <https://www.datastax.com/>
>>>> Find DataStax Online:
>>>>  
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>>>>     
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
>>>>     <https://twitter.com/DataStax>    
>>>> <https://www.datastax.com/blog/rss.xml>    <https://github.com/datastax>
>>> 
> 
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com <http://www.datastax.com/>
> @spyced

Re: [POLL] Vector type for ML

Reply via email to