> The hnsw index can be built just as easily from a non-frozen array.

I have 0 issues removing that limitation =)

> I am in favour of enforcing non-null on the elements of an array by default.

This is why I feel DENSE or NON NULL are the best prefix, as those both imply 
elements may not be null.  A sparse vector represents missing data with the 
zero in its domain, so for a nullable type that is null, but for a int that is 
0…. Its still missing data… where as Dense does not allow missing data (aka NON 
NULL)

> Given many models are exploring quantizing to int8 and other data types, 
> there's definitely the "support other data types easily in the future" piece 
> to me we need to keep in mind.

I took Jonathann’s patch and enhanced it to work with every type currently (and 
in the future) of Cassandra… I added random fuzz testing to make sure that type 
level properties are true for all types (and found bugs in many types… yay?)… 

So in my patch you can do the following

DenseVector<Float>(42) - represents a float based vector, this is working with 
a fixed length type, so uses a encoding that removes lengths (we know this is 4 
bytes… why write that?)
DenseVector<Short>(42) - represents a short based vector, this is working with 
a variable length type (why is ShortType variable length?!?!?!!?!?!?!), so 
encodes the length as part of the format; think frozen list serialization format
DenseVector<Map<DenseVector<Float>, DenseVector<Short>>> - vector of map… if 
you add Short support, then Map is 0 effort, as they require the same things….  
It actually takes more work to not allow map

I block null data, but for numeric types I do not block 0, as 0 is also a valid 
non-null element… (yay math confusion…)… in my definition [0, 0, 0] is a valid 
3 dim vector of int...

> On May 5, 2023, at 8:53 AM, Patrick McFadin <pmcfa...@gmail.com> wrote:
> 
> I hope we are willing to consider developers that use our system because if I 
> had to teach people to use "NON-NULL FROZEN<TYPE[n]>" I'm pretty sure the 
> response would be:
> 
> Did you tell me to go write a distributed map-reduce job in Erlang? I beleive 
> I did, Bob.  
> 
> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie <jmcken...@apache.org 
> <mailto:jmcken...@apache.org>> wrote:
>> Idiomatically, to my mind, there's a question of "what space are we thinking 
>> about this datatype in"?
>> 
>> - In the context of mathematics, nullability in a vector would be 0
>> - In the context of Cassandra, nullability tends to mean a tombstone (or 
>> nothing)
>> - In the context of programming languages, it's all over the place
>> 
>> Given many models are exploring quantizing to int8 and other data types, 
>> there's definitely the "support other data types easily in the future" piece 
>> to me we need to keep in mind.
>> 
>> So with the above and the "meet the user where they are and don't make them 
>> understand more of Cassandra than absolutely critical to use it", I lean:
>> 
>> 1. DENSE_VECTOR<type, dimension>
>> 2. VECTOR<type, dimension>
>> 3. type[dimension]
>> 
>> This leaves the path open for us to expand on it in the future with sparse 
>> support and allows us to introduce some semantics that indicate idioms 
>> around nullability for the users coming from a different space.
>> 
>> "NON-NULL FROZEN<TYPE[n]>" is strictly correct, however it requires 
>> understanding idioms of how Cassandra thinks about data (nulls mean 
>> different things to us, we have differences between frozen and non-frozen 
>> due to constraints in our storage engine and materialization of data, etc) 
>> that get in the way of users doing things in the pattern they're familiar 
>> with without learning more about the DB than they're probably looking to 
>> learn. Historically this has been a challenge for us in adoption; the 
>> classic "Why can't I just write and delete and write as much as I want? Why 
>> are deletes filling up my disk?" problem comes to mind.
>> 
>> I'd also be happy with us supporting:
>> * NON-NULL FROZEN<TYPE[n]>
>> * DENSE_VECTOR<type, dimension> as syntactic sugar for the above
>> 
>> If getting into the "built-in syntactic sugar mapping for communities and 
>> specific use-cases" is something we're willing to consider.
>> 
>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>>> I think we are still discussing implementation here when I'm talking about 
>>> developer experience. I want developers to adopt this quickly, easily and 
>>> be successful. Vector search is already a thing. People use it every day. A 
>>> successful outcome, in my view, is developers picking up this feature 
>>> without reading a manual. (Because they don't anyway and get in trouble) I 
>>> did some more extensive research about what other DBs are using for syntax. 
>>> The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
>>> 
>>> Pinecone[1] - dense_vector, sparse_vector
>>> Elastic[2]: dense_vector
>>> Milvus[3]: float_vector, binary_vector
>>> pgvector[4]: vector
>>> Weaviate[5]: Different approach. All typed arrays can be indexed
>>> 
>>> Based on that I'm advocating a similar syntax:
>>> 
>>> - DENSE VECTOR
>>> or
>>> - VECTOR
>>> 
>>> [1] https://docs.pinecone.io/docs/hybrid-search
>>> [2] 
>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
>>> [3] https://milvus.io/docs/create_collection.md
>>> [4] https://github.com/pgvector/pgvector
>>> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
>>> 
>>> On Fri, May 5, 2023 at 6:07 AM Mike Adamson <madam...@datastax.com 
>>> <mailto:madam...@datastax.com>> wrote:
>>> Then we can have the indexing apparatus only accept frozen<float[n]> for 
>>> the HSNW case.
>>> I'm inclined to agree with Benedict that the index will need to be 
>>> specifically select by option rather than inferred based on type. As such 
>>> there is no real reason for the frozen requirement on the type. The hnsw 
>>> index can be built just as easily from a non-frozen array.
>>> 
>>> I am in favour of enforcing non-null on the elements of an array by 
>>> default. I would prefer that allowing nulls in the array would be a later 
>>> addition if and when a use case arose for it.
>>> 
>>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe <calebrackli...@gmail.com 
>>> <mailto:calebrackli...@gmail.com>> wrote:
>>> Even in the ML case, sparse can just mean zeros rather than nulls, and they 
>>> should compress similarly anyway.
>>> 
>>> If we really want null values, I'd rather leave that in collections space.
>>> 
>>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe <calebrackli...@gmail.com 
>>> <mailto:calebrackli...@gmail.com>> wrote:
>>> I actually still prefer type[dimension], because I think I intuitively read 
>>> this as a primitive (meaning no null elements) array. Then we can have the 
>>> indexing apparatus only accept frozen<float[n]> for the HSNW case.
>>> 
>>> If that isn't intuitive to anyone else, I don't really have a strong 
>>> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One 
>>> should indicate single vs. multi-cell, and the other the presence or 
>>> absence of nulls/zeros/whatever.
>>> 
>>> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin <pmcfa...@gmail.com 
>>> <mailto:pmcfa...@gmail.com>> wrote:
>>> I agree with David's reasoning and the use of DENSE (and maybe eventually 
>>> SPARSE). This is terminology well established in the data world, and it 
>>> would lead to much easier adoption from users. VECTOR is close, but I can 
>>> see having to create a lot of content around "How to use it and not get in 
>>> trouble." (I have a lot of that content already)
>>> 
>>>  - We don't have to explain what it is. A lot of prior art out there 
>>> already [1][2][3]
>>>  - We're matching an established term with what users would expect. No 
>>> surprises. 
>>>  - Shorter ramp-up time for users. Cassandra is being modernized.
>>> 
>>> The implementation is flexible, but the interface should empower our users 
>>> to be awesome. 
>>> 
>>> Patrick
>>> 
>>> 1 - 
>>> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
>>>  
>>> <https://urldefense.com/v3/__https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud6ieKGQw$>
>>> 2 - 
>>> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
>>>  
>>> <https://urldefense.com/v3/__https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ue1o2CO2Q$>
>>> 3 - 
>>> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/ 
>>> <https://urldefense.com/v3/__https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud3U6Hw5A$>
>>> 
>>> On Thu, May 4, 2023 at 10:25 AM David Capwell <dcapw...@apple.com 
>>> <mailto:dcapw...@apple.com>> wrote:
>>> My views have changed over time on syntax and I feel type[dimention] may 
>>> not be the best, so it has gone lower in my own personal ranking… this is 
>>> my current preference
>>> 
>>> 1) DENSE <type>[dimention] | NON NULL <type>[dimention]
>>> 2) VECTOR<type, dimention>
>>> 3) type[dimention]
>>> 
>>> My reasoning for this order
>>> 
>>> * type[dimention] looks like syntax sugar for array<type, dimention>, so 
>>> users may assume list/array semantics, but we limit to non-null elements in 
>>> a frozen array
>>> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct type 
>>> makes more sense… this also leads to a possible future of VECTOR<type> 
>>> which is the non-fixed length version of this type.  What makes VECTOR 
>>> different from list/array?  non-null elements and is frozen.  I don’t feel 
>>> that VECTOR really tells users to expect non-null or frozen semantics, as 
>>> there exists different VECTOR types for those reasons (sparse vs dense)… 
>>> * DENSE may be confusing for people coming from languages where this just 
>>> means “sequential layout”, which is what our frozen array/list already are… 
>>> but since the target user is coming from a ML background, this shouldn’t 
>>> offer much confusion.  DENSE just means FROZEN in Cassandra, with NON NULL 
>>> elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just acts as 
>>> syntax sugar for frozen<non null type[dimention]>
>>> 
>>> 
>>>> On May 4, 2023, at 4:13 AM, Brandon Williams <dri...@gmail.com 
>>>> <mailto:dri...@gmail.com>> wrote:
>>>> 
>>>> 1. VECTOR<FLOAT,n>
>>>> 2. VECTOR FLOAT[n]
>>>> 3. FLOAT[N]   (Non null by default)
>>>> 
>>>> Redundant or not, I think having the VECTOR keyword helps signify what
>>>> the app is generally about and helps get buy-in from ML stakeholders.
>>>> 
>>>> On Thu, May 4, 2023 at 3:45 AM Benedict <bened...@apache.org 
>>>> <mailto:bened...@apache.org>> wrote:
>>>>> 
>>>>> Hurrah for initial agreement.
>>>>> 
>>>>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], 
>>>>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t 
>>>>> think VECTOR should be used to simply imply non-null, as this would be 
>>>>> very unintuitive. More logical would be NONNULL, if this is the only 
>>>>> condition being applied. Alternatively for arrays we could default to 
>>>>> NONNULL and later introduce NULLABLE if we want to permit nulls.
>>>>> 
>>>>> If the word vector is to be used it makes more sense to make it look like 
>>>>> a list, so VECTOR<FLOAT, N> as here the word VECTOR is clearly not 
>>>>> redundant.
>>>>> 
>>>>> So, I vote:
>>>>> 
>>>>> 1) (NON NULL) FLOAT[N]
>>>>> 2) FLOAT[N]   (Non null by default)
>>>>> 3) VECTOR<FLOAT, N>
>>>>> 
>>>>> 
>>>>> 
>>>>> On 4 May 2023, at 08:52, Mick Semb Wever <m...@apache.org 
>>>>> <mailto:m...@apache.org>> wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Did we agree on a CQL syntax?
>>>>>> 
>>>>>> I don’t believe there has been a pool on CQL syntax… my understanding 
>>>>>> reading all the threads is that there are ~4-5 options and non are -1ed, 
>>>>>> so believe we are waiting for majority rule on this?
>>>>> 
>>>>> 
>>>>> 
>>>>> Re-reading that thread, IIUC the valid choices remaining are…
>>>>> 
>>>>> 1. VECTOR FLOAT[n]
>>>>> 2. FLOAT VECTOR[n]
>>>>> 3. VECTOR<FLOAT,n>
>>>>> 4. VECTOR[n]<FLOAT>
>>>>> 5. ARRAY<FLOAT, n>
>>>>> 6. NON-NULL FROZEN<FLOAT[n]>
>>>>> 
>>>>> 
>>>>> Yes I'm putting my preference (1) first ;) because (banging on) if the 
>>>>> future of CQL will have FLOAT[n] and FROZEN<FLOAT[n]>, where the VECTOR 
>>>>> keyword is: for general cql users; just meaning "non-null and frozen", 
>>>>> these gel best together.
>>>>> 
>>>>> Options (5) and (6) are for those that feel we can and should provide 
>>>>> this type without introducing the vector keyword.
>>>>> 
>>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>>  <https://www.datastax.com/>
>>> Mike Adamson
>>> Engineering
>>> +1 650 389 6000 <tel:16503896000> | datastax.com <https://www.datastax.com/>
>>> Find DataStax Online:
>>>  
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>>>     
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
>>>     <https://twitter.com/DataStax>    
>>> <https://www.datastax.com/blog/rss.xml>    <https://github.com/datastax>
>> 

Reply via email to