Thanks Nate. I hadn't noticed that and it definitely explains it. It'd be nice to see that called out much more clearly. As we found out the implications can be severe!
-Josh On Thursday, December 5, 2013 at 11:30 AM, Nate McCall wrote: > Per the 256mb to 5mb change, check the very last section of this page: > http://www.datastax.com/documentation/cql/3.0/webhelp/cql/cql_reference/alter_table_r.html > > "Changing any compaction or compression option erases all previous compaction > or compression settings." > > In other words, you have to include the whole 'WITH' clause each time - in > the future just grab the output from 'show schema' and add/modify as needed. > > I did not know this either until it happened to me as well - could probably > stand to be a little bit more front-and-center, IMO. > > > On Wed, Dec 4, 2013 at 2:59 PM, Josh Dzielak <j...@keen.io > (mailto:j...@keen.io)> wrote: > > We recently had a little Cassandra party I wanted to share and see if > > anyone has notes to compare. Or can tell us what we did wrong or what we > > could do better. :) Apologies in advance for the length of the narrative > > here. > > > > Task at hand: Delete about 50% of the rows in a large column family (~8TB) > > to reclaim some disk. These are rows are used only for intermediate storage. > > > > Sequence of events: > > > > - Issue the actual deletes. This, obviously, was super-fast. > > - Nothing happens yet, which makes sense. New tombstones are not > > immediately compacted b/c of gc_grace_seconds. > > - Adjust gc_grace_seconds down to 60 from 86400 using ALTER TABLE in CQL. > > > > - Every node started working very hard. We saw disk space start to free up. > > It was exciting. > > - Eventually the compactions finished and we had gotten a ton of disk back. > > - However, our SSTables were now 5Mb, not 256Mb as they had always been :( > > - We inspected the schema in CQL/Opscenter etc and sure enough > > sstable_size_in_mb had changed to 5Mb for this CF. Previously all CFs were > > set at 256Mb, and all other CF's still were. > > > > - At 5Mb we had a huge number of SSTables. Our next goal was to get these > > tables back to 256Mb. > > - First step was to update the schema back to 256Mb. > > - Figuring out how to do this in CQL was tricky, because CQL has gone > > through a lot of changes recently and getting the docs for your version is > > hard. Eventually we figured it out - ALTER TABLE events WITH > > compaction={'class':'LeveledCompactionStrategy','sstable_size_in_mb':256}; > > - Out of our 12 nodes, 9 acknowledged the update. The others showed the old > > schema still. > > - The remaining 3 would not. There was no extra load was on the systems, > > operational status was very clean. All nodes could see each other. > > - For each of the remaining 3 we tried to update the schema through a local > > cqlsh session. The same ALTER TABLE would just hang forever. > > - We restarted Cassandra on each of the 3 nodes, then did the ALTER TABLE > > again. It worked this time. We finally had schema agreement. > > > > - Starting with just 1 node, we kicked off upgradesstables, hoping it would > > rebuild the 5Mb tables to 256Mb tables. > > - Nothing happened. This was (afaik) because the sstable size change > > doesn't represent a new version of schema for the sstables. So existing > > tables are ignored. > > - We discovered the "-a" option for upgradesstables, which tells it to skip > > the schema check just and just do all the tables anyway. > > - We ran upgradesstables -a and things started happening. After a few hours > > the pending compactions finished. > > - Sadly, this node was now using 3x the disk it previously had. Some > > sstables were now 256Mb, but not all. There were tens of thousands of ~20Mb > > tables. > > - A direct comparison to other nodes owning the same % of the ring showed > > both the same number of sstables and the same ratio of 256Mb+ tables to > > small tables. However, on a 'normal' node the small tables were all 5-6Mb > > and on the fat, upgraded node, all the tables were 20Mb+. This was why the > > fat node was taking up 3x disk overall. > > - I tried to see what was in those 20Mb files relative to the 5Mb ones but > > sstable2json failed against our authenticated keyspace. I filed a bug > > (https://issues.apache.org/jira/browse/CASSANDRA-6450). > > - Had little choice here. We shut down the fat node, did a manual delete of > > sstables, brought it back up and did a repair. It came back to the right > > size. > > > > TL;DR / Our big questions are: > > How could the schema have spontaneously changed from 256Mb > > sstable_size_in_mb to 5Mb? > > How could schema propagation failed such that only 9 of 12 nodes got the > > change even when cluster was healthy? Why did updating schema locally hang > > until restart? > > What could have happened inside of upgradesstables that created the node > > with the same ring % but 3x disk load? > > > > We're on Cassandra 1.2.8, Java 6, Ubuntu 12. Running on SSD's, 12 node > > cluster across 2 DCs. No compression, leveled compaction. Happy to provide > > more details. Thanks in advance for any insights into what happened or any > > best practices we missed during this episode. > > > > Best, > > Josh > > > > > > > > -- > ----------------- > Nate McCall > Austin, TX > @zznate > > Co-Founder & Sr. Technical Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com