I think Aaron Morton may have talked about a couple optimizations that went into 3.0 for these use cases. I don't have a link handy (on my phone, typing quickly) but I think it's on the last pickle blog. On Fri, Jul 29, 2016 at 1:37 PM DuyHai Doan <doanduy...@gmail.com> wrote:
> @Eric > > Very interesting example. But then what is the case of row (should I say > partition ?) tombstones ? > > Suppose that in your example, I issued a DELETE FROM foo WHERE pk='a' > > With the same SELECT statement than before, would C* be clever enough to > skip reading at all the whole partition (let's limit the example to a > single SSTable) ? > > On Fri, Jul 29, 2016 at 7:00 PM, Eric Stevens <migh...@gmail.com> wrote: > >> > Sai was describing a timeout, not a failure due to the 100 K tombstone >> limit from cassandra.yaml. But I still might be missing things about >> tombstones. >> >> The trouble with tombstones is not the tombstones themselves, rather it's >> that Cassandra has a lot of deleted data to read through in sstables in >> order to satisfy a query. Although if you range constrain your cluster key >> in your query, the read path can optimize that read to start somewhere near >> the correct head of your selected data, that is _not_ true for tombstoned >> data. >> >> Consider this exercise: >> CREATE TABLE foo ( >> pk text, >> ck int, >> PRIMARY KEY ((pk), ck) >> ) >> INSERT INTO foo (pk,ck) VALUES ('a', 1) >> ... >> INSERT INTO foo (pk,ck) VALUES ('a', 100000) >> >> $ nodetool flush >> >> DELETE FROM foo WHERE pk='a' AND ck < 99999 >> >> We've now written a single "tiny" (bytes-wise) range tombstone. >> >> Now try to select from that table: >> SELECT * FROM foo WHERE ck > 50000 LIMIT 1 >> pk | ck >> -- | ------ >> a | 100000 >> >> This has to read from the first sstable, skipping over 49999 records >> before it can locate the first non-tombstoned cell. >> >> The problem isn't the size of the tombstone, tombstones themselves are >> cheaper (again, bytes-wise) than standard columns because they don't >> involve any value for the cell. The problem is that the read path cannot >> anticipate in advance what cells are going to be occluded by the tombstone, >> and in order to satisfy the query it needs to read then discard a large >> number of deleted cells. >> >> The reason the thresholds exist in cassandra.yaml is to help guide users >> away from performance anti-patterns that come from selects which include a >> large number of tombstoned cells. >> >> On Thu, Jul 28, 2016 at 11:08 PM Alain RODRIGUEZ <arodr...@gmail.com> >> wrote: >> >>> Hi, >>> >>> @Eric >>> >>> Large range tombstones can occupy just a few bytes but can occlude >>>> millions of records, and have the corresponding performance impact on >>>> reads. It's really not the size of the tombstone on disk that matters, but >>>> the number of records it occludes. >>> >>> >>> Sai was describing a timeout, not a failure due to the 100 K tombstone >>> limit from cassandra.yaml. But I still might be missing things about >>> tombstones. >>> >>> The read queries are continuously failing though because of the >>>> tombstones. "Request did not complete within rpc_timeout." >>>> >>> >>> So that is what looks weird to me. Reading 220 kb, even holding >>> tombstone should probably not take that long... Or am I wrong or missing >>> something? >>> >>> Your talk looks like cool stuff :-). >>> >>> @Sai >>> >>> The issues here was that tombstones were not in the SSTable, but rather >>>> in the Memtable >>> >>> >>> This sounds weird to me as well, knowing that memory is faster than disk >>> and that memtables are mutable data (so less stuff to read from there). >>> Flushing might have triggered some compaction, removing tombstones though. >>> >>> This still sounds very weird to me but I am glad you solved your issue >>> (temporary at least). >>> >>> C*heers, >>> ----------------------- >>> Alain Rodriguez - al...@thelastpickle.com >>> France >>> >>> The Last Pickle - Apache Cassandra Consulting >>> http://www.thelastpickle.com >>> >>> 2016-07-29 3:25 GMT+02:00 Eric Stevens <migh...@gmail.com>: >>> >>>> Tombstones will not get removed even after gc_grace if bloom filters >>>> indicate that there is overlapping data with the tombstone's partition in a >>>> different sstable. This is because compaction can't be certain that the >>>> tombstone doesn't overlap data in that other table. If you're writing to >>>> one end of a partition key while deleting off the other end (for example >>>> you've created engaged in the queue anti-pattern), your tombstones will >>>> essentially never go away. >>>> >>>> >>>>> 220kb worth of tombstones doesn’t seem like enough to worry about. >>>> >>>> >>>> Large range tombstones can occupy just a few bytes but can occlude >>>> millions of records, and have the corresponding performance impact on >>>> reads. It's really not the size of the tombstone on disk that matters, but >>>> the number of records it occludes. >>>> >>>> You must either do a full compaction (while also not writing to the >>>> partitions being considered, and after you've forced a cluster-wide flush, >>>> and after the tombstones are gc_grace old, and assuming size tiered and not >>>> leveled compaction) to get rid of those tombstones, or probably easier is >>>> to do something similar to sstable2json, remove the tombstones by hand, >>>> then json2sstable and replace the offending sstable. Note that you really >>>> have to be certain what you're doing here or you'll end up resurrecting >>>> deleted records. >>>> >>>> If these all sound like bad options, it's because they are, and you >>>> don't have a lot of options without changing your schema to eventually stop >>>> writing to (and especially reading from) partitions which you also do >>>> deletes on. https://issues.apache.org/jira/browse/CASSANDRA-7019 proposes >>>> to offer a better alternative, but it's still in progress. >>>> >>>> Shameless plug, I'm talking about my company's alternative to >>>> tombstones and TTLs at this year's Cassandra Summit: >>>> http://myeventagenda.com/sessions/1CBFC920-807D-41C1-942C-8D1A7C10F4FA/5/5#sessionID=165 >>>> >>>> >>>> On Thu, Jul 28, 2016 at 11:07 AM sai krishnam raju potturi < >>>> pskraj...@gmail.com> wrote: >>>> >>>>> thanks a lot Alain. That was really great info. >>>>> >>>>> The issues here was that tombstones were not in the SSTable, but >>>>> rather in the Memtable. We had to a nodetool flush, and run a nodetool >>>>> compact to get rid of the tombstones, a million of them. The size of the >>>>> largest SSTable was actually 48MB. >>>>> >>>>> This link was helpful in getting the count of tombstones in a sstable, >>>>> which was 0 in our case. >>>>> https://gist.github.com/JensRantil/063b7c56ca4a8dfe1c50 >>>>> >>>>> The application team did not have a good model. They are working >>>>> on a new datamodel. >>>>> >>>>> thanks >>>>> >>>>> On Wed, Jul 27, 2016 at 7:17 PM, Alain RODRIGUEZ <arodr...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I just released a detailed post about tombstones today that might be >>>>>> of some interest for you: >>>>>> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html >>>>>> >>>>>> 220kb worth of tombstones doesn’t seem like enough to worry about. >>>>>> >>>>>> >>>>>> +1 >>>>>> >>>>>> I believe you might be missing some other bigger SSTable having a lot >>>>>> of tombstones as well. Finding the biggest sstable and reading the >>>>>> tombstone ratio from there might be more relevant. >>>>>> >>>>>> You also should give a try to: "unchecked_tombstone_compaction" set >>>>>> to true rather than tuning other options so aggressively. The "single >>>>>> SSTable compaction" section of my post might help you on this issue: >>>>>> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html#single-sstable-compaction >>>>>> >>>>>> Other thoughts: >>>>>> >>>>>> Also if you use TTLs and timeseries, using TWCS instead of STCS could >>>>>> be more efficient evicting tombstones. >>>>>> >>>>>> we have a columnfamily that has around 1000 rows, with one row is >>>>>>> really huge (million columns) >>>>>> >>>>>> >>>>>> I am sorry to say that this model does not look that great. >>>>>> Imbalances might become an issue as a few nodes will handle a lot more >>>>>> load >>>>>> than the rest of the nodes. Also even if this is getting improved in >>>>>> newer >>>>>> versions of Cassandra, wide rows are something you want to avoid while >>>>>> using 2.0.14 (which is no longer supported for about a year now). I know >>>>>> it >>>>>> is not always easy and never the good time, but maybe should you consider >>>>>> upgrading both your model and your version of Cassandra (regardless of >>>>>> the >>>>>> fact you manage to solve this issue or not with >>>>>> "unchecked_tombstone_compaction"). >>>>>> >>>>>> Good luck, >>>>>> >>>>>> C*heers, >>>>>> ----------------------- >>>>>> Alain Rodriguez - al...@thelastpickle.com >>>>>> France >>>>>> >>>>>> The Last Pickle - Apache Cassandra Consulting >>>>>> http://www.thelastpickle.com >>>>>> >>>>>> 2016-07-28 0:00 GMT+02:00 sai krishnam raju potturi < >>>>>> pskraj...@gmail.com>: >>>>>> >>>>>>> The read queries are continuously failing though because of the >>>>>>> tombstones. "Request did not complete within rpc_timeout." >>>>>>> >>>>>>> thanks >>>>>>> >>>>>>> >>>>>>> On Wed, Jul 27, 2016 at 5:51 PM, Jeff Jirsa < >>>>>>> jeff.ji...@crowdstrike.com> wrote: >>>>>>> >>>>>>>> 220kb worth of tombstones doesn’t seem like enough to worry about. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *From: *sai krishnam raju potturi <pskraj...@gmail.com> >>>>>>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >>>>>>>> *Date: *Wednesday, July 27, 2016 at 2:43 PM >>>>>>>> *To: *Cassandra Users <user@cassandra.apache.org> >>>>>>>> *Subject: *Re: Re : Purging tombstones from a particular row in >>>>>>>> SSTable >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> and also the sstable size in question is like 220 kb in size. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> thanks >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 27, 2016 at 5:41 PM, sai krishnam raju potturi < >>>>>>>> pskraj...@gmail.com> wrote: >>>>>>>> >>>>>>>> it's set to 1800 Vinay. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> bloom_filter_fp_chance=0.010000 AND >>>>>>>> >>>>>>>> caching='KEYS_ONLY' AND >>>>>>>> >>>>>>>> comment='' AND >>>>>>>> >>>>>>>> dclocal_read_repair_chance=0.100000 AND >>>>>>>> >>>>>>>> gc_grace_seconds=1800 AND >>>>>>>> >>>>>>>> index_interval=128 AND >>>>>>>> >>>>>>>> read_repair_chance=0.000000 AND >>>>>>>> >>>>>>>> replicate_on_write='true' AND >>>>>>>> >>>>>>>> populate_io_cache_on_flush='false' AND >>>>>>>> >>>>>>>> default_time_to_live=0 AND >>>>>>>> >>>>>>>> speculative_retry='99.0PERCENTILE' AND >>>>>>>> >>>>>>>> memtable_flush_period_in_ms=0 AND >>>>>>>> >>>>>>>> compaction={'min_sstable_size': '1024', 'tombstone_threshold': >>>>>>>> '0.01', 'tombstone_compaction_interval': '1800', 'class': >>>>>>>> 'SizeTieredCompactionStrategy'} AND >>>>>>>> >>>>>>>> compression={'sstable_compression': 'LZ4Compressor'}; >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> thanks >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 27, 2016 at 5:34 PM, Vinay Kumar Chella < >>>>>>>> vinaykumar...@gmail.com> wrote: >>>>>>>> >>>>>>>> What is your GC_grace_seconds set to? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 27, 2016 at 1:13 PM, sai krishnam raju potturi < >>>>>>>> pskraj...@gmail.com> wrote: >>>>>>>> >>>>>>>> thanks Vinay and DuyHai. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> we are using verison 2.0.14. I did "user defined compaction" >>>>>>>> following the instructions in the below link, The tombstones still >>>>>>>> persist >>>>>>>> even after that. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> https://gist.github.com/jeromatron/e238e5795b3e79866b83 >>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_jeromatron_e238e5795b3e79866b83&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=-sQ3Vf5bs3z4cO36h_AU-kIhMGVKcb3eCtzIb-fZ1Fc&s=0RQ3r6c0L4vICot8eqpOBKBAuKiKEkoKdmcjLbvBBwY&e=> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Also, we changed the tombstone_compaction_interval : 1800 >>>>>>>> and tombstone_threshold : 0.1, but it did not help. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> thanks >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 27, 2016 at 4:05 PM, DuyHai Doan <doanduy...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> This feature is also exposed directly in nodetool from version >>>>>>>> Cassandra 3.4 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> nodetool compact --user-defined <SSTable file> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 27, 2016 at 9:58 PM, Vinay Chella <vche...@netflix.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> You can run file level compaction using JMX to get rid of >>>>>>>> tombstones in one SSTable. Ensure you set GC_Grace_seconds such that >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> current time >= deletion(tombstone time)+ GC_Grace_seconds >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> File level compaction >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> /usr/bin/java -jar cmdline-jmxclient-0.10.3.jar - localhost: >>>>>>>> >>>>>>>> { >>>>>>>> >>>>>>>> port} >>>>>>>> >>>>>>>> org.apache.cassandra.db:type=CompactionManager >>>>>>>> forceUserDefinedCompaction="'${KEYSPACE}','${ >>>>>>>> >>>>>>>> SSTABLEFILENAME >>>>>>>> >>>>>>>> }'"" >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 27, 2016 at 11:59 AM, sai krishnam raju potturi < >>>>>>>> pskraj...@gmail.com> wrote: >>>>>>>> >>>>>>>> hi; >>>>>>>> >>>>>>>> we have a columnfamily that has around 1000 rows, with one row is >>>>>>>> really huge (million columns). 95% of the row contains tombstones. >>>>>>>> Since >>>>>>>> there exists just one SSTable , there is going to be no compaction >>>>>>>> kicked >>>>>>>> in. Any way we can get rid of the tombstones in that row? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Userdefined compaction nor nodetool compact had no effect. Any >>>>>>>> ideas folks? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> thanks >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>> >