Re: [DISCUSS] NULL handling and the unfrozen collection issue

Benjamin Lerer Thu, 16 May 2024 07:34:42 -0700

I found some other confusing behavior in LWT around null value and empty
multicell collection. I opened CASSANDRA-19637
<https://issues.apache.org/jira/browse/CASSANDRA-19637>for those interested.


Le jeu. 4 avr. 2024 à 18:34, Caleb Rackliffe <calebrackli...@gmail.com> a
écrit :

> The easiest way to check out how Accord uses IS NULL and IS NOT NULL is to
> look at the examples in the cep-15-accord branch:
>
>
> https://github.com/apache/cassandra/blob/cep-15-accord/test/distributed/org/apache/cassandra/distributed/test/accord/AccordCQLTestBase.java
>
> tl;dr We did indeed try to go with an approach that more closely matches
> SQL, although there may still be some edges we didn't test.
>
> I'd have no problem w/ moving to 3-value logic everywhere, I guess, but
> "everywhere" could just mean filtering queries and Accord. (i.e. If we want
> to deprecate LWT col = NULL syntax, do we really want people rewriting
> those LWTs...or just moving to the new Accord syntax, which obviously
> supports it? We should "spend" our user query rewrite budget wisely.)
>
> On Thu, Apr 4, 2024 at 4:53 AM Benjamin Lerer <ble...@apache.org> wrote:
>
>> Now, in Cassandra setting a column to null means deleting it and if *all*
>>> columns in a row are null the row is deleted. This might be another edge
>>> case...
>>
>>
>> It is slightly more complicated than that as the primary key columns
>> count in (*all* columns)
>>
>> For example if you have the following table: CREATE TABLE tlb (pk int, c
>> int, v int, PRIMARY KEY (pk, c)) and the following row: INSERT INTO tlb
>> (pk, c, v) VALUES (1, 1, 1)
>> deleting the column v (DELETE v FROM %s WHERE pk = 1 AND c = 1) will not
>> delete the row as the primary key columns have a timestamp and therefore do
>> exist. So the row will still exist with a null value for column v.
>>
>> If the row was created through an UPDATE (UPDATE tlb SET v = 1 WHERE pk =
>> 1 AND c = 1) things will be different as an UPDATE statement do not create
>> a timestamp for the clustering columns. By consequence, if V is deleted
>> (set to null) the row will not have any columns anymore and will be deleted.
>>
>> The issue here is that in practice we never consider partition keys or
>> clustering columns as null if the database returns a row for it. Whether
>> the primary key columns have a timestamp or not.
>> I believe that we should ignore that fact too as far as IS NULL/IS NOT
>> NULL are concerned. If a row exists, its primary columns should be
>> considered as not null. Otherwise we are getting on a really slippery
>> slope. The INSERT/UPDATE logic is confusing enough in my opinion without
>> adding another layer to it.
>>
>> One other issue that we have though is that the code for != LWT does not
>> work with the Three-Valued logic. If you have: [...] IF col != 1  and col
>> is null then in the TVL the value should be UNKNOWN therefore the condition
>> should not match.
>> It feels to me that we should probably keep that behavior for backward
>> compatibility reasons but probably change the behavior in Accord if it is
>> not already done.
>>
>>
>>
>> Le jeu. 21 mars 2024 à 01:10, German Eichberger via dev <
>> dev@cassandra.apache.org> a écrit :
>>
>>> Hi,
>>>
>>> +1 I like doing it the SQL way. This makes sense to me.
>>>
>>> Now, in Cassandra setting a column to null means deleting it and if
>>> *all* columns in a row are null the row is deleted. This might be
>>> another edge case...
>>>
>>> German
>>> ------------------------------
>>> *From:* Benjamin Lerer <b.le...@gmail.com>
>>> *Sent:* Wednesday, March 20, 2024 9:15 AM
>>> *To:* dev@cassandra.apache.org <dev@cassandra.apache.org>
>>> *Subject:* [EXTERNAL] [DISCUSS] NULL handling and the unfrozen
>>> collection issue
>>>
>>> You don't often get email from b.le...@gmail.com. Learn why this is
>>> important <https://aka.ms/LearnAboutSenderIdentification>
>>> Hi everybody,
>>>
>>> CEP-29 (CQL NOT Operator) is hitting the grey area of how we want as a
>>> community to handle NULL including for things like unfrozen (multi-cell)
>>> collections and I would like to make a proposal for moving forward with
>>> NULL related issues.
>>>
>>> We have currently 2 tickets open about NULL handling (I might have
>>> missed others):
>>>
>>>    1. CASSANDRA-10715
>>>    <https://issues.apache.org/jira/browse/CASSANDRA-10715>: Allowing
>>>    Filtering on NULL
>>>    2. CASSANDRA-17762
>>>    <https://issues.apache.org/jira/browse/CASSANDRA-17762>: LWT IF col
>>>    = NULL is inconsistent with SQL NULL
>>>
>>> We also had previously some discussion on which we touched the subject:
>>>
>>>    - [DISCUSS] LWT UPDATE semantics with + and - when null
>>>    - CEP-15 multi key transaction syntax
>>>
>>> In all those tickets and discussions the consensus was to have a
>>> behavior similar to SQL.
>>>
>>> For null comparisons, SQL uses the three-value logic (
>>> https://modern-sql.com/concept/three-valued-logic) introducing the need
>>> for IS NULL and IS NOT NULL operators. Those conflict with the col = NULL
>>> predicate supported in LWT conditions (CASSANDRA-17762
>>> <https://issues.apache.org/jira/browse/CASSANDRA-17762>).
>>>
>>> So far, as Cassandra was only using inclusive operators, comparisons
>>> were behaving in an expected way. According to three-valued logic NULL
>>> CONTAINS 'foo' should return UNKNOWN and the filtering behavior should
>>> exclude everything which is not true.Therefore the row should not be
>>> returned as expected. With exclusive operators things are more tricky.
>>> NULL NOT CONTAINS 'foo' will also return UNKNOWN causing the row to not be
>>> returned which might not match people's expectations.
>>> This behavior can be even more confusing once you take into account
>>> empty and null collections. NOT CONTAINS on an empty collection will return
>>> true while it will return UNKNOWN on a NULL collection. Unfortunately, for
>>> unfrozen (multicell) collections we are unable to differentiate between an
>>> empty and null collection and therefore always treat empty collections as
>>> NULL.
>>> For predicates such as map[myKey] != 'foo' when myKey is not present the
>>> result can also be surprising as it will end up comparing NULL to 'foo'
>>> returning once more UNKNOWN and ignoring the row.
>>> In order to respect the SQL three-valued logic and be able to allow the
>>> user to fetch all the rows which do not contains a specific value we would
>>> need support IS NULL, IS NOT NULL and OR to allow query like:
>>> WHERE c IS NULL OR c NOT CONTAINS 'foo' / WHERE m IS NULL OR m[myKey] !=
>>> foo
>>>
>>> Supporting the three-valued logic makes sense to me even if some
>>> behavior might end up being confusing. In which case we can easily fix
>>> CASSANDRA-10715 and deprectate support for col = NULL/col != NULL in LWT.
>>>
>>> What is people's opinion? Should we go for the three-valued logic
>>> everywhere? Should we try something else?
>>>
>>>
>>>
>>>
>>>
>>>

Re: [DISCUSS] NULL handling and the unfrozen collection issue

Reply via email to