No, because C* has reverse iterators.

On Tue, May 16, 2017 at 4:47 PM, Nitan Kainth <> wrote:

> If the data is stored in ASC order and query asks for DESC, then wouldn’t
> it read whole partition in first and then pick data from reverse order?
> On May 16, 2017, at 10:03 AM, Stefano Ortolani <> wrote:
> Hi Hannu,
> the piece of data in question is older. In my example the tombstone is the
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and
> the data is clustering key sorted, I would expect a linear scan not to be
> necessary.
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <> wrote:
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>> skip bigger regions of deleted data based on range tombstone. If some piece
>> of data in a partition is newer than the tombstone, then it cannot be
>> skipped. Therefore some partition level statistics of cell ages would need
>> to be kept in the column index for the skipping and that is probably not
>> there.
>> Hannu
>> On 16 May 2017, at 17:33, Stefano Ortolani <> wrote:
>> That is another way to see the question: are reverse iterators range
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior.
>> I would expect them to handle this case more gracefully.
>> Cheers,
>> Stefano
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <> wrote:
>>> Hannu,
>>> How can you read a partition in reverse?
>>> Sent from my iPhone
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <> wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range
>>> tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is
>>> within the range of the tombstone but is newer than the tombstone and
>>> therefore it might be still be returned. Scanning through deleted data can
>>> be avoided by reading the partition in reverse (if all the deleted data is
>>> in the beginning of the partition). Eventually you will still end up
>>> reading a lot of tombstones but you will get a lot of live data first and
>>> the implicit query limit of 10000 probably is reached before you get to the
>>> tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani <>
>>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide
>>> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence
>>> the message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as
>>> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain
>>> a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
>>> _half_ of that partition by executing the query below, and restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes
>>> more than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid ASC;
>>> >>
>>> >> While the following returns immediately (obviously because no deleted
>>> data is ever read):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid DESC;
>>> >>
>>> >> If I force a compaction the problem is gone, but I presume just
>>> because the data is rearranged.
>>> >>
>>> >> It seems to me that reading by ASC does not make use of the range
>>> tombstone until C* reads the
>>> >> last sstables (which actually contains the range tombstone and is
>>> flushed at node restart), and it wastes time reading all rows that are
>>> actually not live anymore.
>>> >>
>>> >> Is this expected? Should the range tombstone actually help in these
>>> cases?
>>> >>
>>> >> Thanks a lot!
>>> >> Stefano
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail:
>>> > For additional commands, e-mail:
>>> >

Reply via email to