Nice work :) A
----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 9/06/2012, at 1:48 AM, Luke Hospadaruk wrote: > Follow-up: > After adding the EBS nodes, I successfully compacted, the node that had ~1.3T > is now down to about 400/500GB (some of that is compression savings). You're > right about the load – lots of overwrites. > > I'm going to get things back off the EBS and add a couple more nodes (I've > got 4 right now, maybe move up to 6 or 8 for the time being. > > I also plan on copying all my CFs to new ones to un-do the major compaction. > I've got some fairly minor schema changes in mind, so it's a good time to > copy over my data anyways. > > Thanks for all the help, it's been very informative > > Luke > > From: aaron morton <aa...@thelastpickle.com<mailto:aa...@thelastpickle.com>> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" > <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" > <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Subject: Re: Nodes not picking up data on repair, disk loaded unevenly > > I am now running major compactions on those nodes (and all is well so far). > Major compaction in this situation will make things worse. When end up with > one big file you will need that much space again to compact / upgrade / > re-write it. > > back down to a normal size, can I move all the data back off the ebs volumes? > something along the lines of: > Yup. > > Then add some more nodes to the cluster to keep this from happening in the > future. > Yerp. Get everything settled and repair running it should be a simple > operation. > > I assume all the files stored in any of the data directories are all uniquely > named and cassandra won't really care where they are as long as everything it > wants is in it's data directories. > Unique on each node. > > So it looks like I never got the tree from node #2 (the node which has > particularly out of control disk usage). > If you look at the logs for 2. you will probably find an error. > Or it may still be running, check nodetool compactionstats > > -Is there any way to force replay of hints to empty this out – just a full > cluster restart when everything is working again maybe? > Normally I would say stop the nodes and delete the hints CF's. As you have > deleted CF's from one of the nodes there is a risk of losing data though. > > If you have been working at CL QUORUM and have not been getting > TimedOutException you can still delete the hints. As the writes they contain > should be on at least one other node and they will be repaired by repair. > > I have a high replication factor and all my writes have been at cl=ONE (so > all the data in the hints should actually exist in a CF somewhere right?). > There is a chance that a write was only applied locally on the node that you > delete the data from, and it recorded hints to send to the othe nodes. It's a > remote chance but still there. > > how much working space does this need? Problem is that node #2 is so full > I'm not sure any major rebuild or compaction will be susccessful. The other > nodes seem to be handiling things ok although they are still heavily loaded. > upgradetables processes one SSTable at a time, it only needs enough space to > re-write the SSTable. > > This is why major compaction hurts in these situations. If you have 1.5T of > small files, you may have enough free space to re-write all the files. If you > have a single 1.5T file you don't. > > This cluster has a super high write load currently since I'm still building > it out. I frequently update every row in my CFs > Sounds like a lot of overwrites. When you get compaction running it may purge > a lot of data. > > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 7/06/2012, at 2:51 AM, Luke Hospadaruk wrote: > > Thanks for the tips > > Some things I found looking around: > > grepping the logs for a specific repair I ran yesterday: > > /var/log/cassandra# grep df14e460-af48-11e1-0000-e9014560c7bd system.log > INFO [AntiEntropySessions:13] 2012-06-05 19:58:51,303 AntiEntropyService.java > (line 658) [repair #df14e460-af48-11e1-0000-e9014560c7bd] new session: will > sync /4.xx.xx.xx, /1.xx.xx.xx, /3.xx.xx.xx, /2.xx.xx.xx on range > (85070591730234615865843651857942052864,127605887595351923798765477786913079296] > for content.[article2] > INFO [AntiEntropySessions:13] 2012-06-05 19:58:51,304 AntiEntropyService.java > (line 837) [repair #df14e460-af48-11e1-0000-e9014560c7bd] requests for merkle > tree sent for article2 (to [ /4.xx.xx.xx, /1.xx.xx.xx, /3.xx.xx.xx, > /2.xx.xx.xx]) > INFO [AntiEntropyStage:1] 2012-06-05 20:07:01,169 AntiEntropyService.java > (line 190) [repair #df14e460-af48-11e1-0000-e9014560c7bd] Received merkle > tree for article2 from /4.xx.xx.xx > INFO [AntiEntropyStage:1] 2012-06-06 04:12:30,633 AntiEntropyService.java > (line 190) [repair #df14e460-af48-11e1-0000-e9014560c7bd] Received merkle > tree for article2 from /3.xx.xx.xx > INFO [AntiEntropyStage:1] 2012-06-06 07:02:51,497 AntiEntropyService.java > (line 190) [repair #df14e460-af48-11e1-0000-e9014560c7bd] Received merkle > tree for article2 from /1.xx.xx.xx > > So it looks like I never got the tree from node #2 (the node which has > particularly out of control disk usage). > > These are running on amazon m1.xlarge instances with all the EBS volumes > raided together for a total of 1.7TB. > > What version are you using ? > 1.0 > > Has there been times when nodes were down ? > Yes, but mostly just restarts, and mostly just one node at a time > > Clear as much space as possible from the disk. Check for snapshots in all > KS's. > Already done. > > What KS's (including the system KS) are taking up the most space ? Are there > a lot of hints in the system KS (they are not replicated)? > -There's just one KS that I'm actually using, which is taking up anywhere > from about 650GB on the node I was able to scrub and compact (that sounds > like the right size to me), and 1.3T on the node that is hugely bloated. > -There are pretty big huge hints CFs on all but one node (the node I deleted > data from, although I did not delete any hints from there). They're between > 175GB and 250GB depending on the node. > -Is there any way to force replay of hints to empty this out – just a full > cluster restart when everything is working again maybe? > -Could I just disable hinted handoff and wipe out those tables? I realize > I'll loose those hints, but that doesn't bother me terribly. I have a high > replication factor and all my writes have been at cl=ONE (so all the data in > the hints should actually exist in a CF somewhere right?). Perhaps more > importantly if some data has been stalled in a hints table for a week I won't > really miss it since it basically doesn't exist right now. I can re-write > any data that got lost (although that's not ideal). > > Try to get a feel for what CF's are taking up the space or not as the case my > be. Look in nodetool cfstats to see how big the rows are. > The hints table and my tables are the only thing taking up any significant > space on the system > > you have enabled compression run nodetool upgradetables to compress them. > how much working space does this need? Problem is that node #2 is so full > I'm not sure any major rebuild or compaction will be susccessful. The other > nodes seem to be handiling things ok although they are still heavily loaded. > > In general, try to get free space on the nodes by using compaction, moving > files to a new mount etc so that you can get repair to run. > -I'll try adding an EBS volume or two to the bloated node and see if that > allows me to successfuly compact/repair. > -If I add another volume to that node, then run some compactions and such to > the point where everything fits on the main volume again, I may just replace > that node with a new one. Can I move things off of and then kill the ebs > volume? > > Other thoughts/notes: > This cluster has a super high write load currently since I'm still building > it out. I frequently update every row in my CFs > I almost certainly need to add more capacity (more nodes). The general plan > is to get everything sort of working first though, since repairs and such are > currently failing it seems like a bad time to add more nodes. > > Thanks, > Luke > > From: aaron morton > <aa...@thelastpickle.com<mailto:aa...@thelastpickle.com><mailto:aa...@thelastpickle.com>> > Reply-To: > "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>" > > <user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>> > To: > "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>" > > <user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>> > Subject: Re: Nodes not picking up data on repair, disk loaded unevenly > > You are basically in trouble. If you can nuke it and start again it would be > easier. If you want to figure out how to get out of it keep the cluster up > and have a play. > > > -What I think the solution should be: > You want to get repair to work before you start deleting data. > > At ~840GB I'm probably running close > to the max load I should have on a node, > roughly 300GB to 400GB is the max load > > On node #1 I was able to successfully run a scrub and > major compaction, > In this situation running a major compaction is now what you want. it creates > a huge file that can only be compacted if there is enough space for another > huge file. Smaller files only need small space to be compacted. > > Is there something I should be looking for in the logs to verify that the > repair was successful? > grep for "repair command" > > The shortcut on EC2 is add an EBS volumn, tell cassandra it can store stuff > there (in the yaml) and buy some breathing room. > > > What version are you using ? > > Has there been times when nodes were down ? > > Clear as much space as possible from the disk. Check for snapshots in all > KS's. > > What KS's (including the system KS) are taking up the most space ? Are there > a lot of hints in the system KS (they are not replicated)? > > Try to get a feel for what CF's are taking up the space or not as the case my > be. Look in nodetool cfstats to see how big the rows are. > > I you have enabled compression run nodetool upgradetables to compress them. > > > In general, try to get free space on the nodes by using compaction, moving > files to a new mount etc so that you can get repair to run. > > Cheers > > > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 6/06/2012, at 6:53 AM, Luke Hospadaruk wrote: > > I have a 4-node cluster with one keyspace (aside from the system keyspace) > with the replication factor set to 4. The disk usage between the nodes is > pretty wildly different and I'm wondering why. It's becoming a problem > because one node is getting to the point where it sometimes fails to > compact because it doesn't have enough space. > > I've been doing a lot of experimenting with the schema, adding/dropping > things, changing settings around (not ideal I realize, but we're still in > development). > > In an ideal world, I'd launch another cluster (this is all hosted in > amazon), copy all the data to that, and just get rid of my current > cluster, but the current cluster is in use by some other parties so > rebuilding everything is impractical (although possible if it's the only > reliable solution). > > $ nodetool -h localhost ring > Address DC Rack Status State Load Owns Token > > > 1.xx.xx.xx Cassandra rack1 Up Normal 837.8 GB 25.00% 0 > > 2.xx.xx.xx Cassandra rack1 Up Normal 1.17 TB 25.00% > 42535295865117307932921825928971026432 > 3.xx.xx.xx Cassandra rack1 Up Normal 977.23 GB 25.00% > 85070591730234615865843651857942052864 > 4.xx.xx.xx Cassandra rack1 Up Normal 291.2 GB 25.00% > 127605887595351923798765477786913079296 > > -Problems I'm having: > Nodes are running out of space and are apparently unable to perform > compactions because of it. These machines have 1.7T total space each. > > The logs for node #2 have a lot of warnings about insufficient space for > compaction. Node number 4 was so extremely out of space (cassandra was > failing to start because of it)that I removed all the SSTables for one of > the less essential column families just to bring it back online. > > > I have (since I started noticing these issues) enabled compression for all > my column families. On node #1 I was able to successfully run a scrub and > major compaction, so I suspect that the disk usage for node #1 is about > where all the other nodes should be. At ~840GB I'm probably running close > to the max load I should have on a node, so I may need to launch more > nodes into the cluster, but I'd like to get things straightened out before > I introduce more potential issues (token moving, etc). > > Node #4 seems not to be picking up all the data it should have (since > repication factor == number of nodes, the load should be roughly the > same?). I've run repairs on that node to seemingly no avail (after repair > finishes, it still has about the same disk usage, which is much too low). > > > -What I think the solution should be: > One node at a time: > 1) nodetool drain the node > 2) shut down cassandra on the node > 3) wipe out all the data in my keyspace on the node > 4) bring cassandra back up > 5) nodetool repair > > -My concern: > This is basically what I did with node #4 (although I didn't drain, and I > didn't wipe the entire keyspace), and it doesn't seem to have regained all > the data it's supposed to have after the repair. The column family should > have at least 200-300GB of data, and the SSTables in the data directory > only total about 11GB, am I missing something? > > Is there a way to verify that a node _really_ has all the data it's > supposed to have? > > I don't want to do this process to each node and discover at the end of it > that I've lost a ton of data. > > Is there something I should be looking for in the logs to verify that the > repair was successful? If I do a 'nodetool netstats' during the repair I > don't see any streams going in or out of node #4. > > Thanks, > Luke > > >