Re: Re : Purging tombstones from a particular row in SSTable

2016-07-30 Thread DuyHai Doan
Look like skipping SSTables based on max SSTable timestamp is possible if
your have the partition deletion info:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L538-L550

But it doesn't say nothing about iterating all cells in a single partition
if having a partition tombstone, I need to dig further




On Sat, Jul 30, 2016 at 2:03 AM, Eric Stevens  wrote:

> I haven't tested that specifically, but I haven't bumped into any
> particular optimization that allows it to skip reading an sstable where the
> entire relevant partition has been row-tombstoned.  It's possible that
> something like that could happen by examining min/max timestamps on
> sstables, and not reading from any sstable with a partition-level tombstone
> where the max timestamp is less than the timestamp of the partition
> tombstone.  However that presumes that it can have read the tombstones from
> each sstable before it read the occluded data, which I don't think is
> likely.
>
> Such an optimization could be there, but I haven't noticed it if it is,
> though I'm certainly not an expert (more of a well informed novice).  If
> someone wants to set me straight on this point I'd love to know about it.
>
> On Fri, Jul 29, 2016 at 2:37 PM DuyHai Doan  wrote:
>
>> @Eric
>>
>> Very interesting example. But then what is the case of row (should I say
>> partition ?) tombstones ?
>>
>> Suppose that in your example, I issued a DELETE FROM foo WHERE pk='a'
>>
>> With the same SELECT statement than before, would C* be clever enough to
>> skip reading at all the whole partition (let's limit the example to a
>> single SSTable) ?
>>
>> On Fri, Jul 29, 2016 at 7:00 PM, Eric Stevens  wrote:
>>
>>> > Sai was describing a timeout, not a failure due to the 100 K
>>> tombstone limit from cassandra.yaml. But I still might be missing things
>>> about tombstones.
>>>
>>> The trouble with tombstones is not the tombstones themselves, rather
>>> it's that Cassandra has a lot of deleted data to read through in sstables
>>> in order to satisfy a query.  Although if you range constrain your cluster
>>> key in your query, the read path can optimize that read to start somewhere
>>> near the correct head of your selected data, that is _not_ true for
>>> tombstoned data.
>>>
>>> Consider this exercise:
>>> CREATE TABLE foo (
>>>   pk text,
>>>   ck int,
>>>   PRIMARY KEY ((pk), ck)
>>> )
>>> INSERT INTO foo (pk,ck) VALUES ('a', 1)
>>> ...
>>> INSERT INTO foo (pk,ck) VALUES ('a', 10)
>>>
>>> $ nodetool flush
>>>
>>> DELETE FROM foo WHERE pk='a' AND ck < 9
>>>
>>> We've now written a single "tiny" (bytes-wise) range tombstone.
>>>
>>> Now try to select from that table:
>>> SELECT * FROM foo WHERE ck > 5 LIMIT 1
>>> pk | ck
>>> -- | --
>>> a  | 10
>>>
>>> This has to read from the first sstable, skipping over 4 records
>>> before it can locate the first non-tombstoned cell.
>>>
>>> The problem isn't the size of the tombstone, tombstones themselves are
>>> cheaper (again, bytes-wise) than standard columns because they don't
>>> involve any value for the cell.  The problem is that the read path cannot
>>> anticipate in advance what cells are going to be occluded by the tombstone,
>>> and in order to satisfy the query it needs to read then discard a large
>>> number of deleted cells.
>>>
>>> The reason the thresholds exist in cassandra.yaml is to help guide users
>>> away from performance anti-patterns that come from selects which include a
>>> large number of tombstoned cells.
>>>
>>> On Thu, Jul 28, 2016 at 11:08 PM Alain RODRIGUEZ 
>>> wrote:
>>>
 Hi,

 @Eric

 Large range tombstones can occupy just a few bytes but can occlude
> millions of records, and have the corresponding performance impact on
> reads.  It's really not the size of the tombstone on disk that matters, 
> but
> the number of records it occludes.


 Sai was describing a timeout, not a failure due to the 100 K tombstone
 limit from cassandra.yaml. But I still might be missing things about
 tombstones.

 The read queries are continuously failing though because of the
> tombstones. "Request did not complete within rpc_timeout."
>

 So that is what looks weird to me. Reading 220 kb, even holding
 tombstone should probably not take that long... Or am I wrong or missing
 something?

 Your talk looks like cool stuff :-).

 @Sai

 The issues here was that tombstones were not in the SSTable, but rather
> in the Memtable


 This sounds weird to me as well, knowing that memory is faster than
 disk and that memtables are mutable data (so less stuff to read from
 there). Flushing might have triggered some compaction, removing tombstones
 though.

 This still sounds very weird to me but I am glad you solved your issue
 (temporary at least).

 C*heers,
 --

Re: Re : Purging tombstones from a particular row in SSTable

2016-07-30 Thread Eric Stevens
Aah yep, you're right, that appears to be exactly the optimization I was
thinking would be possible, but hadn't encountered.

Of course the caveat is that it only works if you are deleting entire
partitions, then continuing to write to that partition after the fact.  If
you do a partial delete (some part of your cluster key is in your DELETE
... WHERE clause), this optimization won't apply.

This has been around for a while, it seems it was introduced all the way
back in 1.1 with https://issues.apache.org/jira/browse/CASSANDRA-4116


On Sat, Jul 30, 2016 at 5:52 AM DuyHai Doan  wrote:

> Look like skipping SSTables based on max SSTable timestamp is possible if
> your have the partition deletion info:
>
>
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L538-L550
>
> But it doesn't say nothing about iterating all cells in a single partition
> if having a partition tombstone, I need to dig further
>
>
>
>
> On Sat, Jul 30, 2016 at 2:03 AM, Eric Stevens  wrote:
>
>> I haven't tested that specifically, but I haven't bumped into any
>> particular optimization that allows it to skip reading an sstable where the
>> entire relevant partition has been row-tombstoned.  It's possible that
>> something like that could happen by examining min/max timestamps on
>> sstables, and not reading from any sstable with a partition-level tombstone
>> where the max timestamp is less than the timestamp of the partition
>> tombstone.  However that presumes that it can have read the tombstones from
>> each sstable before it read the occluded data, which I don't think is
>> likely.
>>
>> Such an optimization could be there, but I haven't noticed it if it is,
>> though I'm certainly not an expert (more of a well informed novice).  If
>> someone wants to set me straight on this point I'd love to know about it.
>>
>> On Fri, Jul 29, 2016 at 2:37 PM DuyHai Doan  wrote:
>>
>>> @Eric
>>>
>>> Very interesting example. But then what is the case of row (should I say
>>> partition ?) tombstones ?
>>>
>>> Suppose that in your example, I issued a DELETE FROM foo WHERE pk='a'
>>>
>>> With the same SELECT statement than before, would C* be clever enough to
>>> skip reading at all the whole partition (let's limit the example to a
>>> single SSTable) ?
>>>
>>> On Fri, Jul 29, 2016 at 7:00 PM, Eric Stevens  wrote:
>>>
 > Sai was describing a timeout, not a failure due to the 100 K
 tombstone limit from cassandra.yaml. But I still might be missing things
 about tombstones.

 The trouble with tombstones is not the tombstones themselves, rather
 it's that Cassandra has a lot of deleted data to read through in sstables
 in order to satisfy a query.  Although if you range constrain your cluster
 key in your query, the read path can optimize that read to start somewhere
 near the correct head of your selected data, that is _not_ true for
 tombstoned data.

 Consider this exercise:
 CREATE TABLE foo (
   pk text,
   ck int,
   PRIMARY KEY ((pk), ck)
 )
 INSERT INTO foo (pk,ck) VALUES ('a', 1)
 ...
 INSERT INTO foo (pk,ck) VALUES ('a', 10)

 $ nodetool flush

 DELETE FROM foo WHERE pk='a' AND ck < 9

 We've now written a single "tiny" (bytes-wise) range tombstone.

 Now try to select from that table:
 SELECT * FROM foo WHERE ck > 5 LIMIT 1
 pk | ck
 -- | --
 a  | 10

 This has to read from the first sstable, skipping over 4 records
 before it can locate the first non-tombstoned cell.

 The problem isn't the size of the tombstone, tombstones themselves are
 cheaper (again, bytes-wise) than standard columns because they don't
 involve any value for the cell.  The problem is that the read path cannot
 anticipate in advance what cells are going to be occluded by the tombstone,
 and in order to satisfy the query it needs to read then discard a large
 number of deleted cells.

 The reason the thresholds exist in cassandra.yaml is to help guide
 users away from performance anti-patterns that come from selects which
 include a large number of tombstoned cells.

 On Thu, Jul 28, 2016 at 11:08 PM Alain RODRIGUEZ 
 wrote:

> Hi,
>
> @Eric
>
> Large range tombstones can occupy just a few bytes but can occlude
>> millions of records, and have the corresponding performance impact on
>> reads.  It's really not the size of the tombstone on disk that matters, 
>> but
>> the number of records it occludes.
>
>
> Sai was describing a timeout, not a failure due to the 100 K tombstone
> limit from cassandra.yaml. But I still might be missing things about
> tombstones.
>
> The read queries are continuously failing though because of the
>> tombstones. "Request did not complete within rpc_timeout."
>>
>
> So that is what