Re: Re : Purging tombstones from a particular row in SSTable

DuyHai Doan Sat, 30 Jul 2016 04:53:15 -0700

Look like skipping SSTables based on max SSTable timestamp is possible if
your have the partition deletion info:


https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L538-L550

But it doesn't say nothing about iterating all cells in a single partition
if having a partition tombstone, I need to dig further




On Sat, Jul 30, 2016 at 2:03 AM, Eric Stevens <migh...@gmail.com> wrote:

> I haven't tested that specifically, but I haven't bumped into any
> particular optimization that allows it to skip reading an sstable where the
> entire relevant partition has been row-tombstoned.  It's possible that
> something like that could happen by examining min/max timestamps on
> sstables, and not reading from any sstable with a partition-level tombstone
> where the max timestamp is less than the timestamp of the partition
> tombstone.  However that presumes that it can have read the tombstones from
> each sstable before it read the occluded data, which I don't think is
> likely.
>
> Such an optimization could be there, but I haven't noticed it if it is,
> though I'm certainly not an expert (more of a well informed novice).  If
> someone wants to set me straight on this point I'd love to know about it.
>
> On Fri, Jul 29, 2016 at 2:37 PM DuyHai Doan <doanduy...@gmail.com> wrote:
>
>> @Eric
>>
>> Very interesting example. But then what is the case of row (should I say
>> partition ?) tombstones ?
>>
>> Suppose that in your example, I issued a DELETE FROM foo WHERE pk='a'
>>
>> With the same SELECT statement than before, would C* be clever enough to
>> skip reading at all the whole partition (let's limit the example to a
>> single SSTable) ?
>>
>> On Fri, Jul 29, 2016 at 7:00 PM, Eric Stevens <migh...@gmail.com> wrote:
>>
>>> > Sai was describing a timeout, not a failure due to the 100 K
>>> tombstone limit from cassandra.yaml. But I still might be missing things
>>> about tombstones.
>>>
>>> The trouble with tombstones is not the tombstones themselves, rather
>>> it's that Cassandra has a lot of deleted data to read through in sstables
>>> in order to satisfy a query.  Although if you range constrain your cluster
>>> key in your query, the read path can optimize that read to start somewhere
>>> near the correct head of your selected data, that is _not_ true for
>>> tombstoned data.
>>>
>>> Consider this exercise:
>>> CREATE TABLE foo (
>>>   pk text,
>>>   ck int,
>>>   PRIMARY KEY ((pk), ck)
>>> )
>>> INSERT INTO foo (pk,ck) VALUES ('a', 1)
>>> ...
>>> INSERT INTO foo (pk,ck) VALUES ('a', 100000)
>>>
>>> $ nodetool flush
>>>
>>> DELETE FROM foo WHERE pk='a' AND ck < 99999
>>>
>>> We've now written a single "tiny" (bytes-wise) range tombstone.
>>>
>>> Now try to select from that table:
>>> SELECT * FROM foo WHERE ck > 50000 LIMIT 1
>>> pk | ck
>>> -- | ------
>>> a  | 100000
>>>
>>> This has to read from the first sstable, skipping over 49999 records
>>> before it can locate the first non-tombstoned cell.
>>>
>>> The problem isn't the size of the tombstone, tombstones themselves are
>>> cheaper (again, bytes-wise) than standard columns because they don't
>>> involve any value for the cell.  The problem is that the read path cannot
>>> anticipate in advance what cells are going to be occluded by the tombstone,
>>> and in order to satisfy the query it needs to read then discard a large
>>> number of deleted cells.
>>>
>>> The reason the thresholds exist in cassandra.yaml is to help guide users
>>> away from performance anti-patterns that come from selects which include a
>>> large number of tombstoned cells.
>>>
>>> On Thu, Jul 28, 2016 at 11:08 PM Alain RODRIGUEZ <arodr...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> @Eric
>>>>
>>>> Large range tombstones can occupy just a few bytes but can occlude
>>>>> millions of records, and have the corresponding performance impact on
>>>>> reads.  It's really not the size of the tombstone on disk that matters, 
>>>>> but
>>>>> the number of records it occludes.
>>>>
>>>>
>>>> Sai was describing a timeout, not a failure due to the 100 K tombstone
>>>> limit from cassandra.yaml. But I still might be missing things about
>>>> tombstones.
>>>>
>>>> The read queries are continuously failing though because of the
>>>>> tombstones. "Request did not complete within rpc_timeout."
>>>>>
>>>>
>>>> So that is what looks weird to me. Reading 220 kb, even holding
>>>> tombstone should probably not take that long... Or am I wrong or missing
>>>> something?
>>>>
>>>> Your talk looks like cool stuff :-).
>>>>
>>>> @Sai
>>>>
>>>> The issues here was that tombstones were not in the SSTable, but rather
>>>>> in the Memtable
>>>>
>>>>
>>>> This sounds weird to me as well, knowing that memory is faster than
>>>> disk and that memtables are mutable data (so less stuff to read from
>>>> there). Flushing might have triggered some compaction, removing tombstones
>>>> though.
>>>>
>>>> This still sounds very weird to me but I am glad you solved your issue
>>>> (temporary at least).
>>>>
>>>> C*heers,
>>>> -----------------------
>>>> Alain Rodriguez - al...@thelastpickle.com
>>>> France
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>> 2016-07-29 3:25 GMT+02:00 Eric Stevens <migh...@gmail.com>:
>>>>
>>>>> Tombstones will not get removed even after gc_grace if bloom filters
>>>>> indicate that there is overlapping data with the tombstone's partition in 
>>>>> a
>>>>> different sstable.  This is because compaction can't be certain that the
>>>>> tombstone doesn't overlap data in that other table.  If you're writing to
>>>>> one end of a partition key while deleting off the other end (for example
>>>>> you've created engaged in the queue anti-pattern), your tombstones will
>>>>> essentially never go away.
>>>>>
>>>>>
>>>>>> 220kb worth of tombstones doesn’t seem like enough to worry about.
>>>>>
>>>>>
>>>>> Large range tombstones can occupy just a few bytes but can occlude
>>>>> millions of records, and have the corresponding performance impact on
>>>>> reads.  It's really not the size of the tombstone on disk that matters, 
>>>>> but
>>>>> the number of records it occludes.
>>>>>
>>>>> You must either do a full compaction (while also not writing to the
>>>>> partitions being considered, and after you've forced a cluster-wide flush,
>>>>> and after the tombstones are gc_grace old, and assuming size tiered and 
>>>>> not
>>>>> leveled compaction) to get rid of those tombstones, or probably easier is
>>>>> to do something similar to sstable2json, remove the tombstones by hand,
>>>>> then json2sstable and replace the offending sstable.  Note that you really
>>>>> have to be certain what you're doing here or you'll end up resurrecting
>>>>> deleted records.
>>>>>
>>>>> If these all sound like bad options, it's because they are, and you
>>>>> don't have a lot of options without changing your schema to eventually 
>>>>> stop
>>>>> writing to (and especially reading from) partitions which you also do
>>>>> deletes on.  https://issues.apache.org/jira/browse/CASSANDRA-7019 proposes
>>>>> to offer a better alternative, but it's still in progress.
>>>>>
>>>>> Shameless plug, I'm talking about my company's alternative to
>>>>> tombstones and TTLs at this year's Cassandra Summit:
>>>>> http://myeventagenda.com/sessions/1CBFC920-807D-41C1-942C-8D1A7C10F4FA/5/5#sessionID=165
>>>>>
>>>>>
>>>>> On Thu, Jul 28, 2016 at 11:07 AM sai krishnam raju potturi <
>>>>> pskraj...@gmail.com> wrote:
>>>>>
>>>>>> thanks a lot Alain. That was really great info.
>>>>>>
>>>>>> The issues here was that tombstones were not in the SSTable, but
>>>>>> rather in the Memtable. We had to a nodetool flush, and run a nodetool
>>>>>> compact to get rid of the tombstones, a million of them. The size of the
>>>>>> largest SSTable was actually 48MB.
>>>>>>
>>>>>> This link was helpful in getting the count of tombstones in a
>>>>>> sstable, which was 0 in our case.
>>>>>> https://gist.github.com/JensRantil/063b7c56ca4a8dfe1c50
>>>>>>
>>>>>>     The application team did not have a good model. They are working
>>>>>> on a new datamodel.
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>> On Wed, Jul 27, 2016 at 7:17 PM, Alain RODRIGUEZ <arodr...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I just released a detailed post about tombstones today that might be
>>>>>>> of some interest for you:
>>>>>>> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
>>>>>>>
>>>>>>> 220kb worth of tombstones doesn’t seem like enough to worry about.
>>>>>>>
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> I believe you might be missing some other bigger SSTable having a
>>>>>>> lot of tombstones as well. Finding the biggest sstable and reading the
>>>>>>> tombstone ratio from there might be more relevant.
>>>>>>>
>>>>>>> You also should give a try to: "unchecked_tombstone_compaction" set
>>>>>>> to true rather than tuning other options so aggressively. The "single
>>>>>>> SSTable compaction" section of my post might help you on this issue:
>>>>>>> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html#single-sstable-compaction
>>>>>>>
>>>>>>> Other thoughts:
>>>>>>>
>>>>>>> Also if you use TTLs and timeseries, using TWCS instead of STCS
>>>>>>> could be more efficient evicting tombstones.
>>>>>>>
>>>>>>> we have a columnfamily that has around 1000 rows, with one row is
>>>>>>>> really huge (million columns)
>>>>>>>
>>>>>>>
>>>>>>> I am sorry to say that this model does not look that great.
>>>>>>> Imbalances might become an issue as a few nodes will handle a lot more 
>>>>>>> load
>>>>>>> than the rest of the nodes. Also even if this is getting improved in 
>>>>>>> newer
>>>>>>> versions of Cassandra, wide rows are something you want to avoid while
>>>>>>> using 2.0.14 (which is no longer supported for about a year now). I 
>>>>>>> know it
>>>>>>> is not always easy and never the good time, but maybe should you 
>>>>>>> consider
>>>>>>> upgrading both your model and your version of Cassandra (regardless of 
>>>>>>> the
>>>>>>> fact you manage to solve this issue or not with
>>>>>>> "unchecked_tombstone_compaction").
>>>>>>>
>>>>>>> Good luck,
>>>>>>>
>>>>>>> C*heers,
>>>>>>> -----------------------
>>>>>>> Alain Rodriguez - al...@thelastpickle.com
>>>>>>> France
>>>>>>>
>>>>>>> The Last Pickle - Apache Cassandra Consulting
>>>>>>> http://www.thelastpickle.com
>>>>>>>
>>>>>>> 2016-07-28 0:00 GMT+02:00 sai krishnam raju potturi <
>>>>>>> pskraj...@gmail.com>:
>>>>>>>
>>>>>>>> The read queries are continuously failing though because of the
>>>>>>>> tombstones. "Request did not complete within rpc_timeout."
>>>>>>>>
>>>>>>>> thanks
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 27, 2016 at 5:51 PM, Jeff Jirsa <
>>>>>>>> jeff.ji...@crowdstrike.com> wrote:
>>>>>>>>
>>>>>>>>> 220kb worth of tombstones doesn’t seem like enough to worry about.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *sai krishnam raju potturi <pskraj...@gmail.com>
>>>>>>>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org
>>>>>>>>> >
>>>>>>>>> *Date: *Wednesday, July 27, 2016 at 2:43 PM
>>>>>>>>> *To: *Cassandra Users <user@cassandra.apache.org>
>>>>>>>>> *Subject: *Re: Re : Purging tombstones from a particular row in
>>>>>>>>> SSTable
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> and also the sstable size in question is like 220 kb in size.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 5:41 PM, sai krishnam raju potturi <
>>>>>>>>> pskraj...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> it's set to 1800 Vinay.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  bloom_filter_fp_chance=0.010000 AND
>>>>>>>>>
>>>>>>>>>   caching='KEYS_ONLY' AND
>>>>>>>>>
>>>>>>>>>   comment='' AND
>>>>>>>>>
>>>>>>>>>   dclocal_read_repair_chance=0.100000 AND
>>>>>>>>>
>>>>>>>>>   gc_grace_seconds=1800 AND
>>>>>>>>>
>>>>>>>>>   index_interval=128 AND
>>>>>>>>>
>>>>>>>>>   read_repair_chance=0.000000 AND
>>>>>>>>>
>>>>>>>>>   replicate_on_write='true' AND
>>>>>>>>>
>>>>>>>>>   populate_io_cache_on_flush='false' AND
>>>>>>>>>
>>>>>>>>>   default_time_to_live=0 AND
>>>>>>>>>
>>>>>>>>>   speculative_retry='99.0PERCENTILE' AND
>>>>>>>>>
>>>>>>>>>   memtable_flush_period_in_ms=0 AND
>>>>>>>>>
>>>>>>>>>   compaction={'min_sstable_size': '1024', 'tombstone_threshold':
>>>>>>>>> '0.01', 'tombstone_compaction_interval': '1800', 'class':
>>>>>>>>> 'SizeTieredCompactionStrategy'} AND
>>>>>>>>>
>>>>>>>>>   compression={'sstable_compression': 'LZ4Compressor'};
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 5:34 PM, Vinay Kumar Chella <
>>>>>>>>> vinaykumar...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> What is your GC_grace_seconds set to?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 1:13 PM, sai krishnam raju potturi <
>>>>>>>>> pskraj...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> thanks Vinay and DuyHai.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     we are using verison 2.0.14. I did "user defined compaction"
>>>>>>>>> following the instructions in the below link, The tombstones still 
>>>>>>>>> persist
>>>>>>>>> even after that.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://gist.github.com/jeromatron/e238e5795b3e79866b83
>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_jeromatron_e238e5795b3e79866b83&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=-sQ3Vf5bs3z4cO36h_AU-kIhMGVKcb3eCtzIb-fZ1Fc&s=0RQ3r6c0L4vICot8eqpOBKBAuKiKEkoKdmcjLbvBBwY&e=>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also, we changed the tombstone_compaction_interval : 1800
>>>>>>>>> and tombstone_threshold : 0.1, but it did not help.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 4:05 PM, DuyHai Doan <doanduy...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> This feature is also exposed directly in nodetool from version
>>>>>>>>> Cassandra 3.4
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> nodetool compact --user-defined <SSTable file>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 9:58 PM, Vinay Chella <vche...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> You can run file level compaction using JMX to get rid of
>>>>>>>>> tombstones in one SSTable. Ensure you set GC_Grace_seconds such that
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> current time >= deletion(tombstone time)+ GC_Grace_seconds
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> File level compaction
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /usr/bin/java -jar cmdline-jmxclient-0.10.3.jar - localhost:
>>>>>>>>>
>>>>>>>>> {
>>>>>>>>>
>>>>>>>>> port}
>>>>>>>>>
>>>>>>>>>  org.apache.cassandra.db:type=CompactionManager 
>>>>>>>>> forceUserDefinedCompaction="'${KEYSPACE}','${
>>>>>>>>>
>>>>>>>>> SSTABLEFILENAME
>>>>>>>>>
>>>>>>>>> }'""
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 11:59 AM, sai krishnam raju potturi <
>>>>>>>>> pskraj...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> hi;
>>>>>>>>>
>>>>>>>>>   we have a columnfamily that has around 1000 rows, with one row
>>>>>>>>> is really huge (million columns). 95% of the row contains tombstones. 
>>>>>>>>> Since
>>>>>>>>> there exists just one SSTable , there is going to be no compaction 
>>>>>>>>> kicked
>>>>>>>>> in. Any way we can get rid of the tombstones in that row?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Userdefined compaction nor nodetool compact had no effect. Any
>>>>>>>>> ideas folks?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>

Re: Re : Purging tombstones from a particular row in SSTable

Reply via email to