Re: Range deletes, wide partitions, and reverse iterators

Stefano Ortolani Tue, 16 May 2017 09:40:51 -0700

Little update: also the following query timeouts, which is weird since the
range tombstone should have been read by then...


SELECT *
FROM test_cql.test_cf
WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
AND timeid < the_oldest_deleted_timeid
ORDER BY timeid DESC;



On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani <ostef...@gmail.com>
wrote:

> Yes, that was my intention but I wanted to cross-check with the ML and the
> devs keeping an eye on it first.
>
> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger <hkro...@gmail.com> wrote:
>
>> Well,
>>
>> sstables contain some statistics about the cell timestamps and using that
>> information and the tombstone timestamp it might be possible to skip some
>> data but I’m not sure that Cassandra currently does that. Maybe it would be
>> worth a JIRA ticket and see what the devs think about it. If optimizing
>> this case would make sense.
>>
>> Hannu
>>
>> On 16 May 2017, at 18:03, Stefano Ortolani <ostef...@gmail.com> wrote:
>>
>> Hi Hannu,
>>
>> the piece of data in question is older. In my example the tombstone is
>> the newest piece of data.
>> Since a range tombstone has information re the clustering key ranges, and
>> the data is clustering key sorted, I would expect a linear scan not to be
>> necessary.
>>
>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com> wrote:
>>
>>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>>> skip bigger regions of deleted data based on range tombstone. If some piece
>>> of data in a partition is newer than the tombstone, then it cannot be
>>> skipped. Therefore some partition level statistics of cell ages would need
>>> to be kept in the column index for the skipping and that is probably not
>>> there.
>>>
>>> Hannu
>>>
>>> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com> wrote:
>>>
>>> That is another way to see the question: are reverse iterators range
>>> tombstone aware? Yes.
>>> That is why I am puzzled by this afore-mentioned behavior.
>>> I would expect them to handle this case more gracefully.
>>>
>>> Cheers,
>>> Stefano
>>>
>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com> wrote:
>>>
>>>> Hannu,
>>>>
>>>> How can you read a partition in reverse?
>>>>
>>>> Sent from my iPhone
>>>>
>>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com> wrote:
>>>> >
>>>> > Well, I’m guessing that Cassandra doesn't really know if the range
>>>> tombstone is useful for this or not.
>>>> >
>>>> > In many cases it might be that the partition contains data that is
>>>> within the range of the tombstone but is newer than the tombstone and
>>>> therefore it might be still be returned. Scanning through deleted data can
>>>> be avoided by reading the partition in reverse (if all the deleted data is
>>>> in the beginning of the partition). Eventually you will still end up
>>>> reading a lot of tombstones but you will get a lot of live data first and
>>>> the implicit query limit of 10000 probably is reached before you get to the
>>>> tombstones. Therefore you will get an immediate answer.
>>>> >
>>>> > Does it make sense?
>>>> >
>>>> > Hannu
>>>> >
>>>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Hi all,
>>>> >>
>>>> >> I am seeing inconsistencies when mixing range tombstones, wide
>>>> partitions, and reverse iterators.
>>>> >> I still have to understand if the behaviour is to be expected hence
>>>> the message on the mailing list.
>>>> >>
>>>> >> The situation is conceptually simple. I am using a table defined as
>>>> follows:
>>>> >>
>>>> >> CREATE TABLE test_cql.test_cf (
>>>> >>  hash blob,
>>>> >>  timeid timeuuid,
>>>> >>  PRIMARY KEY (hash, timeid)
>>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>>> >>
>>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain
>>>> a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
>>>> _half_ of that partition by executing the query below, and restart the 
>>>> node:
>>>> >>
>>>> >> DELETE
>>>> >> FROM test_cql.test_cf
>>>> >> WHERE hash = x AND timeid < y;
>>>> >>
>>>> >> If I keep compactions disabled the following query timeouts (takes
>>>> more than 10 seconds to
>>>> >> succeed):
>>>> >>
>>>> >> SELECT *
>>>> >> FROM test_cql.test_cf
>>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>>> >> ORDER BY timeid ASC;
>>>> >>
>>>> >> While the following returns immediately (obviously because no
>>>> deleted data is ever read):
>>>> >>
>>>> >> SELECT *
>>>> >> FROM test_cql.test_cf
>>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>>> >> ORDER BY timeid DESC;
>>>> >>
>>>> >> If I force a compaction the problem is gone, but I presume just
>>>> because the data is rearranged.
>>>> >>
>>>> >> It seems to me that reading by ASC does not make use of the range
>>>> tombstone until C* reads the
>>>> >> last sstables (which actually contains the range tombstone and is
>>>> flushed at node restart), and it wastes time reading all rows that are
>>>> actually not live anymore.
>>>> >>
>>>> >> Is this expected? Should the range tombstone actually help in these
>>>> cases?
>>>> >>
>>>> >> Thanks a lot!
>>>> >> Stefano
>>>> >
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>>>> >
>>>>
>>>
>>>
>>>
>>
>>
>

Re: Range deletes, wide partitions, and reverse iterators

Reply via email to