Little update: also the following query timeouts, which is weird since the range tombstone should have been read by then...
SELECT * FROM test_cql.test_cf WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf AND timeid < the_oldest_deleted_timeid ORDER BY timeid DESC; On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani <ostef...@gmail.com> wrote: > Yes, that was my intention but I wanted to cross-check with the ML and the > devs keeping an eye on it first. > > On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger <hkro...@gmail.com> wrote: > >> Well, >> >> sstables contain some statistics about the cell timestamps and using that >> information and the tombstone timestamp it might be possible to skip some >> data but I’m not sure that Cassandra currently does that. Maybe it would be >> worth a JIRA ticket and see what the devs think about it. If optimizing >> this case would make sense. >> >> Hannu >> >> On 16 May 2017, at 18:03, Stefano Ortolani <ostef...@gmail.com> wrote: >> >> Hi Hannu, >> >> the piece of data in question is older. In my example the tombstone is >> the newest piece of data. >> Since a range tombstone has information re the clustering key ranges, and >> the data is clustering key sorted, I would expect a linear scan not to be >> necessary. >> >> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com> wrote: >> >>> Well, as mentioned, probably Cassandra doesn’t have logic and data to >>> skip bigger regions of deleted data based on range tombstone. If some piece >>> of data in a partition is newer than the tombstone, then it cannot be >>> skipped. Therefore some partition level statistics of cell ages would need >>> to be kept in the column index for the skipping and that is probably not >>> there. >>> >>> Hannu >>> >>> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com> wrote: >>> >>> That is another way to see the question: are reverse iterators range >>> tombstone aware? Yes. >>> That is why I am puzzled by this afore-mentioned behavior. >>> I would expect them to handle this case more gracefully. >>> >>> Cheers, >>> Stefano >>> >>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com> wrote: >>> >>>> Hannu, >>>> >>>> How can you read a partition in reverse? >>>> >>>> Sent from my iPhone >>>> >>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com> wrote: >>>> > >>>> > Well, I’m guessing that Cassandra doesn't really know if the range >>>> tombstone is useful for this or not. >>>> > >>>> > In many cases it might be that the partition contains data that is >>>> within the range of the tombstone but is newer than the tombstone and >>>> therefore it might be still be returned. Scanning through deleted data can >>>> be avoided by reading the partition in reverse (if all the deleted data is >>>> in the beginning of the partition). Eventually you will still end up >>>> reading a lot of tombstones but you will get a lot of live data first and >>>> the implicit query limit of 10000 probably is reached before you get to the >>>> tombstones. Therefore you will get an immediate answer. >>>> > >>>> > Does it make sense? >>>> > >>>> > Hannu >>>> > >>>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com> >>>> wrote: >>>> >> >>>> >> Hi all, >>>> >> >>>> >> I am seeing inconsistencies when mixing range tombstones, wide >>>> partitions, and reverse iterators. >>>> >> I still have to understand if the behaviour is to be expected hence >>>> the message on the mailing list. >>>> >> >>>> >> The situation is conceptually simple. I am using a table defined as >>>> follows: >>>> >> >>>> >> CREATE TABLE test_cql.test_cf ( >>>> >> hash blob, >>>> >> timeid timeuuid, >>>> >> PRIMARY KEY (hash, timeid) >>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC) >>>> >> AND compaction = {'class' : 'LeveledCompactionStrategy'}; >>>> >> >>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain >>>> a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest >>>> _half_ of that partition by executing the query below, and restart the >>>> node: >>>> >> >>>> >> DELETE >>>> >> FROM test_cql.test_cf >>>> >> WHERE hash = x AND timeid < y; >>>> >> >>>> >> If I keep compactions disabled the following query timeouts (takes >>>> more than 10 seconds to >>>> >> succeed): >>>> >> >>>> >> SELECT * >>>> >> FROM test_cql.test_cf >>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf >>>> >> ORDER BY timeid ASC; >>>> >> >>>> >> While the following returns immediately (obviously because no >>>> deleted data is ever read): >>>> >> >>>> >> SELECT * >>>> >> FROM test_cql.test_cf >>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf >>>> >> ORDER BY timeid DESC; >>>> >> >>>> >> If I force a compaction the problem is gone, but I presume just >>>> because the data is rearranged. >>>> >> >>>> >> It seems to me that reading by ASC does not make use of the range >>>> tombstone until C* reads the >>>> >> last sstables (which actually contains the range tombstone and is >>>> flushed at node restart), and it wastes time reading all rows that are >>>> actually not live anymore. >>>> >> >>>> >> Is this expected? Should the range tombstone actually help in these >>>> cases? >>>> >> >>>> >> Thanks a lot! >>>> >> Stefano >>>> > >>>> > >>>> > --------------------------------------------------------------------- >>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>>> > For additional commands, e-mail: user-h...@cassandra.apache.org >>>> > >>>> >>> >>> >>> >> >> >