Went through and created a spreed sheet of current votes… For Patric and Mike, I don’t see a clear vote, so I put a ? where I “think” your preference is… for Mick, I only put one vote as the list looked like a summary, but you mentioned the first was your preference
Syntax Jonathan Ellis David Capwell Josh McKenzie Caleb Rackliffe Patrick McFadin Brandon Williams Mike Adamson Benedict Mick Semb Wever VECTOR<type, dimension> 1 2 2 1 ? 3 DENSE VECTOR<type, dimension> 2 1 ? ? type[dimension] 3 3 3 1 3 2 DENSE_VECTOR<type, dimension> 1 NON NULL <type>[dimention] 1 1 VECTOR type[n] 2 1 ARRAY<type, n> NON-NULL FROZEN<type[n]> 1 = top pick 2 = second pick 3 = third pick Let me know if I am missing anyone, or if I have bad data > On May 5, 2023, at 9:23 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > > +10 for not inflicting unwieldy keywords on ML users. > > Re Josh's summary, mostly agreed, my only objection to adding the DENSE > keyword is that I don't see a foreseeable future where we also support sparse > vectors, so it would end up being unnecessary extra verbosity. So my > preference would be > > 1. VECTOR<type, dimension> > 2. DENSE VECTOR<type, dimension> (space instead of underscore, SQL isn't > afraid of spaces) > 3. type[dimension] > > On Fri, May 5, 2023 at 10:54 AM Patrick McFadin <pmcfa...@gmail.com > <mailto:pmcfa...@gmail.com>> wrote: >> I hope we are willing to consider developers that use our system because if >> I had to teach people to use "NON-NULL FROZEN<TYPE[n]>" I'm pretty sure the >> response would be: >> >> Did you tell me to go write a distributed map-reduce job in Erlang? I >> beleive I did, Bob. >> >> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie <jmcken...@apache.org >> <mailto:jmcken...@apache.org>> wrote: >>> Idiomatically, to my mind, there's a question of "what space are we >>> thinking about this datatype in"? >>> >>> - In the context of mathematics, nullability in a vector would be 0 >>> - In the context of Cassandra, nullability tends to mean a tombstone (or >>> nothing) >>> - In the context of programming languages, it's all over the place >>> >>> Given many models are exploring quantizing to int8 and other data types, >>> there's definitely the "support other data types easily in the future" >>> piece to me we need to keep in mind. >>> >>> So with the above and the "meet the user where they are and don't make them >>> understand more of Cassandra than absolutely critical to use it", I lean: >>> >>> 1. DENSE_VECTOR<type, dimension> >>> 2. VECTOR<type, dimension> >>> 3. type[dimension] >>> >>> This leaves the path open for us to expand on it in the future with sparse >>> support and allows us to introduce some semantics that indicate idioms >>> around nullability for the users coming from a different space. >>> >>> "NON-NULL FROZEN<TYPE[n]>" is strictly correct, however it requires >>> understanding idioms of how Cassandra thinks about data (nulls mean >>> different things to us, we have differences between frozen and non-frozen >>> due to constraints in our storage engine and materialization of data, etc) >>> that get in the way of users doing things in the pattern they're familiar >>> with without learning more about the DB than they're probably looking to >>> learn. Historically this has been a challenge for us in adoption; the >>> classic "Why can't I just write and delete and write as much as I want? Why >>> are deletes filling up my disk?" problem comes to mind. >>> >>> I'd also be happy with us supporting: >>> * NON-NULL FROZEN<TYPE[n]> >>> * DENSE_VECTOR<type, dimension> as syntactic sugar for the above >>> >>> If getting into the "built-in syntactic sugar mapping for communities and >>> specific use-cases" is something we're willing to consider. >>> >>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote: >>>> I think we are still discussing implementation here when I'm talking about >>>> developer experience. I want developers to adopt this quickly, easily and >>>> be successful. Vector search is already a thing. People use it every day. >>>> A successful outcome, in my view, is developers picking up this feature >>>> without reading a manual. (Because they don't anyway and get in trouble) I >>>> did some more extensive research about what other DBs are using for >>>> syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE' >>>> >>>> Pinecone[1] - dense_vector, sparse_vector >>>> Elastic[2]: dense_vector >>>> Milvus[3]: float_vector, binary_vector >>>> pgvector[4]: vector >>>> Weaviate[5]: Different approach. All typed arrays can be indexed >>>> >>>> Based on that I'm advocating a similar syntax: >>>> >>>> - DENSE VECTOR >>>> or >>>> - VECTOR >>>> >>>> [1] https://docs.pinecone.io/docs/hybrid-search >>>> [2] >>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html >>>> [3] https://milvus.io/docs/create_collection.md >>>> [4] https://github.com/pgvector/pgvector >>>> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes >>>> >>>> On Fri, May 5, 2023 at 6:07 AM Mike Adamson <madam...@datastax.com >>>> <mailto:madam...@datastax.com>> wrote: >>>> Then we can have the indexing apparatus only accept frozen<float[n]> for >>>> the HSNW case. >>>> I'm inclined to agree with Benedict that the index will need to be >>>> specifically select by option rather than inferred based on type. As such >>>> there is no real reason for the frozen requirement on the type. The hnsw >>>> index can be built just as easily from a non-frozen array. >>>> >>>> I am in favour of enforcing non-null on the elements of an array by >>>> default. I would prefer that allowing nulls in the array would be a later >>>> addition if and when a use case arose for it. >>>> >>>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe <calebrackli...@gmail.com >>>> <mailto:calebrackli...@gmail.com>> wrote: >>>> Even in the ML case, sparse can just mean zeros rather than nulls, and >>>> they should compress similarly anyway. >>>> >>>> If we really want null values, I'd rather leave that in collections space. >>>> >>>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe <calebrackli...@gmail.com >>>> <mailto:calebrackli...@gmail.com>> wrote: >>>> I actually still prefer type[dimension], because I think I intuitively >>>> read this as a primitive (meaning no null elements) array. Then we can >>>> have the indexing apparatus only accept frozen<float[n]> for the HSNW case. >>>> >>>> If that isn't intuitive to anyone else, I don't really have a strong >>>> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One >>>> should indicate single vs. multi-cell, and the other the presence or >>>> absence of nulls/zeros/whatever. >>>> >>>> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin <pmcfa...@gmail.com >>>> <mailto:pmcfa...@gmail.com>> wrote: >>>> I agree with David's reasoning and the use of DENSE (and maybe eventually >>>> SPARSE). This is terminology well established in the data world, and it >>>> would lead to much easier adoption from users. VECTOR is close, but I can >>>> see having to create a lot of content around "How to use it and not get in >>>> trouble." (I have a lot of that content already) >>>> >>>> - We don't have to explain what it is. A lot of prior art out there >>>> already [1][2][3] >>>> - We're matching an established term with what users would expect. No >>>> surprises. >>>> - Shorter ramp-up time for users. Cassandra is being modernized. >>>> >>>> The implementation is flexible, but the interface should empower our users >>>> to be awesome. >>>> >>>> Patrick >>>> >>>> 1 - >>>> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks >>>> >>>> <https://urldefense.com/v3/__https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud6ieKGQw$> >>>> 2 - >>>> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035 >>>> >>>> <https://urldefense.com/v3/__https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ue1o2CO2Q$> >>>> 3 - >>>> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/ >>>> <https://urldefense.com/v3/__https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud3U6Hw5A$> >>>> >>>> On Thu, May 4, 2023 at 10:25 AM David Capwell <dcapw...@apple.com >>>> <mailto:dcapw...@apple.com>> wrote: >>>> My views have changed over time on syntax and I feel type[dimention] may >>>> not be the best, so it has gone lower in my own personal ranking… this is >>>> my current preference >>>> >>>> 1) DENSE <type>[dimention] | NON NULL <type>[dimention] >>>> 2) VECTOR<type, dimention> >>>> 3) type[dimention] >>>> >>>> My reasoning for this order >>>> >>>> * type[dimention] looks like syntax sugar for array<type, dimention>, so >>>> users may assume list/array semantics, but we limit to non-null elements >>>> in a frozen array >>>> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct type >>>> makes more sense… this also leads to a possible future of VECTOR<type> >>>> which is the non-fixed length version of this type. What makes VECTOR >>>> different from list/array? non-null elements and is frozen. I don’t feel >>>> that VECTOR really tells users to expect non-null or frozen semantics, as >>>> there exists different VECTOR types for those reasons (sparse vs dense)… >>>> * DENSE may be confusing for people coming from languages where this just >>>> means “sequential layout”, which is what our frozen array/list already >>>> are… but since the target user is coming from a ML background, this >>>> shouldn’t offer much confusion. DENSE just means FROZEN in Cassandra, >>>> with NON NULL elements (SPARSE allows for NULL and isn’t frozen)… So DENSE >>>> just acts as syntax sugar for frozen<non null type[dimention]> >>>> >>>> >>>>> On May 4, 2023, at 4:13 AM, Brandon Williams <dri...@gmail.com >>>>> <mailto:dri...@gmail.com>> wrote: >>>>> >>>>> 1. VECTOR<FLOAT,n> >>>>> 2. VECTOR FLOAT[n] >>>>> 3. FLOAT[N] (Non null by default) >>>>> >>>>> Redundant or not, I think having the VECTOR keyword helps signify what >>>>> the app is generally about and helps get buy-in from ML stakeholders. >>>>> >>>>> On Thu, May 4, 2023 at 3:45 AM Benedict <bened...@apache.org >>>>> <mailto:bened...@apache.org>> wrote: >>>>>> >>>>>> Hurrah for initial agreement. >>>>>> >>>>>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], >>>>>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t >>>>>> think VECTOR should be used to simply imply non-null, as this would be >>>>>> very unintuitive. More logical would be NONNULL, if this is the only >>>>>> condition being applied. Alternatively for arrays we could default to >>>>>> NONNULL and later introduce NULLABLE if we want to permit nulls. >>>>>> >>>>>> If the word vector is to be used it makes more sense to make it look >>>>>> like a list, so VECTOR<FLOAT, N> as here the word VECTOR is clearly not >>>>>> redundant. >>>>>> >>>>>> So, I vote: >>>>>> >>>>>> 1) (NON NULL) FLOAT[N] >>>>>> 2) FLOAT[N] (Non null by default) >>>>>> 3) VECTOR<FLOAT, N> >>>>>> >>>>>> >>>>>> >>>>>> On 4 May 2023, at 08:52, Mick Semb Wever <m...@apache.org >>>>>> <mailto:m...@apache.org>> wrote: >>>>>> >>>>>> >>>>>>> >>>>>>> Did we agree on a CQL syntax? >>>>>>> >>>>>>> I don’t believe there has been a pool on CQL syntax… my understanding >>>>>>> reading all the threads is that there are ~4-5 options and non are >>>>>>> -1ed, so believe we are waiting for majority rule on this? >>>>>> >>>>>> >>>>>> >>>>>> Re-reading that thread, IIUC the valid choices remaining are… >>>>>> >>>>>> 1. VECTOR FLOAT[n] >>>>>> 2. FLOAT VECTOR[n] >>>>>> 3. VECTOR<FLOAT,n> >>>>>> 4. VECTOR[n]<FLOAT> >>>>>> 5. ARRAY<FLOAT, n> >>>>>> 6. NON-NULL FROZEN<FLOAT[n]> >>>>>> >>>>>> >>>>>> Yes I'm putting my preference (1) first ;) because (banging on) if the >>>>>> future of CQL will have FLOAT[n] and FROZEN<FLOAT[n]>, where the VECTOR >>>>>> keyword is: for general cql users; just meaning "non-null and frozen", >>>>>> these gel best together. >>>>>> >>>>>> Options (5) and (6) are for those that feel we can and should provide >>>>>> this type without introducing the vector keyword. >>>>>> >>>>>> >>>> >>>> >>>> >>>> >>>> -- >>>> <https://www.datastax.com/> >>>> Mike Adamson >>>> Engineering >>>> +1 650 389 6000 <tel:16503896000> | datastax.com >>>> <https://www.datastax.com/> >>>> Find DataStax Online: >>>> >>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=> >>>> >>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=> >>>> <https://twitter.com/DataStax> >>>> <https://www.datastax.com/blog/rss.xml> <https://github.com/datastax> >>> > > > -- > Jonathan Ellis > co-founder, http://www.datastax.com <http://www.datastax.com/> > @spyced