Re: Nodes not picking up data on repair, disk loaded unevenly

aaron morton Fri, 08 Jun 2012 11:52:09 -0700

Nice work :)

A


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 9/06/2012, at 1:48 AM, Luke Hospadaruk wrote:

> Follow-up:
> After adding the EBS nodes, I successfully compacted, the node that had ~1.3T 
> is now down to about 400/500GB (some of that is compression savings).  You're 
> right about the load – lots of overwrites.
> 
> I'm going to get things back off the EBS and add a couple more nodes (I've 
> got 4 right now, maybe move up to 6 or 8 for the time being.
> 
> I also plan on copying all my CFs to new ones to un-do the major compaction.  
> I've got some fairly minor schema changes in mind, so it's a good time to 
> copy over my data anyways.
> 
> Thanks for all the help, it's been very informative
> 
> Luke
> 
> From: aaron morton <aa...@thelastpickle.com<mailto:aa...@thelastpickle.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Re: Nodes not picking up data on repair, disk loaded unevenly
> 
> I am now running major compactions on those nodes (and all is well so far).
> Major compaction in this situation will make things worse. When end up with 
> one big file you will need that much space again to compact / upgrade / 
> re-write it.
> 
> back down to a normal size, can I move all the data back off the ebs volumes?
> something along the lines of:
> Yup.
> 
> Then add some more nodes to the cluster to keep this from happening in the 
> future.
> Yerp. Get everything settled and repair running it should be a simple 
> operation.
> 
> I assume all the files stored in any of the data directories are all uniquely 
> named and cassandra won't really care where they are as long as everything it 
> wants is in it's data directories.
> Unique on each node.
> 
> So it looks like I never got the tree from node #2 (the node which has 
> particularly out of control disk usage).
> If you look at the logs for 2. you will probably find an error.
> Or it may still be running, check nodetool compactionstats
> 
> -Is there any way to force replay of hints to empty this out – just a full 
> cluster restart when everything is working again maybe?
> Normally I would say stop the nodes and delete the hints CF's. As you have 
> deleted CF's from one of the nodes there is a risk of losing data though.
> 
> If you have been working at CL QUORUM and have not been getting 
> TimedOutException you can still delete the hints. As the writes they contain 
> should be on at least one other node and they will be repaired by repair.
> 
> I have a high replication factor and all my writes have been at cl=ONE (so 
> all the data in the hints should actually exist in a CF somewhere right?).
> There is a chance that a write was only applied locally on the node that you 
> delete the data from, and it recorded hints to send to the othe nodes. It's a 
> remote chance but still there.
> 
> how much working space does this need?  Problem is that node #2 is so full 
> I'm not sure any major rebuild or compaction will be susccessful.  The other 
> nodes seem to be handiling things ok although they are still heavily loaded.
> upgradetables processes one SSTable at a time, it only needs enough space to 
> re-write the SSTable.
> 
> This is why major compaction hurts in these situations. If you have 1.5T of 
> small files, you may have enough free space to re-write all the files. If you 
> have a single 1.5T file you don't.
> 
> This cluster has a super high write load currently since I'm still building 
> it out.  I frequently update every row in my CFs
> Sounds like a lot of overwrites. When you get compaction running it may purge 
> a lot of data.
> 
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 7/06/2012, at 2:51 AM, Luke Hospadaruk wrote:
> 
> Thanks for the tips
> 
> Some things I found looking around:
> 
> grepping the logs for a specific repair I ran yesterday:
> 
> /var/log/cassandra# grep df14e460-af48-11e1-0000-e9014560c7bd system.log
> INFO [AntiEntropySessions:13] 2012-06-05 19:58:51,303 AntiEntropyService.java 
> (line 658) [repair #df14e460-af48-11e1-0000-e9014560c7bd] new session: will 
> sync /4.xx.xx.xx, /1.xx.xx.xx, /3.xx.xx.xx, /2.xx.xx.xx on range 
> (85070591730234615865843651857942052864,127605887595351923798765477786913079296]
>  for content.[article2]
> INFO [AntiEntropySessions:13] 2012-06-05 19:58:51,304 AntiEntropyService.java 
> (line 837) [repair #df14e460-af48-11e1-0000-e9014560c7bd] requests for merkle 
> tree sent for article2 (to [ /4.xx.xx.xx, /1.xx.xx.xx, /3.xx.xx.xx, 
> /2.xx.xx.xx])
> INFO [AntiEntropyStage:1] 2012-06-05 20:07:01,169 AntiEntropyService.java 
> (line 190) [repair #df14e460-af48-11e1-0000-e9014560c7bd] Received merkle 
> tree for article2 from /4.xx.xx.xx
> INFO [AntiEntropyStage:1] 2012-06-06 04:12:30,633 AntiEntropyService.java 
> (line 190) [repair #df14e460-af48-11e1-0000-e9014560c7bd] Received merkle 
> tree for article2 from /3.xx.xx.xx
> INFO [AntiEntropyStage:1] 2012-06-06 07:02:51,497 AntiEntropyService.java 
> (line 190) [repair #df14e460-af48-11e1-0000-e9014560c7bd] Received merkle 
> tree for article2 from /1.xx.xx.xx
> 
> So it looks like I never got the tree from node #2 (the node which has 
> particularly out of control disk usage).
> 
> These are running on amazon m1.xlarge instances with all the EBS volumes 
> raided together for a total of 1.7TB.
> 
> What version are you using ?
> 1.0
> 
> Has there been times when nodes were down ?
> Yes, but mostly just restarts, and mostly just one node at a time
> 
> Clear as much space as possible from the disk. Check for snapshots in all 
> KS's.
> Already done.
> 
> What KS's (including the system KS) are taking up the most space ? Are there 
> a lot of hints in the system KS (they are not replicated)?
> -There's just one KS that I'm actually using, which is taking up anywhere 
> from about 650GB on the node I was able to scrub and compact (that sounds 
> like the right size to me), and 1.3T on the node that is hugely bloated.
> -There are pretty big huge hints CFs on all but one node (the node I deleted 
> data from, although I did not delete any hints from there). They're between 
> 175GB and 250GB depending on the node.
> -Is there any way to force replay of hints to empty this out – just a full 
> cluster restart when everything is working again maybe?
> -Could I just disable hinted handoff and wipe out those tables?  I realize 
> I'll loose those hints, but that doesn't bother me terribly.  I have a high 
> replication factor and all my writes have been at cl=ONE (so all the data in 
> the hints should actually exist in a CF somewhere right?).  Perhaps more 
> importantly if some data has been stalled in a hints table for a week I won't 
> really miss it since it basically doesn't exist right now.  I can re-write 
> any data that got lost (although that's not ideal).
> 
> Try to get a feel for what CF's are taking up the space or not as the case my 
> be. Look in nodetool cfstats to see how big the rows are.
> The hints table and my tables are the only thing taking up any significant 
> space on the system
> 
> you have enabled compression run nodetool upgradetables to compress them.
> how much working space does this need?  Problem is that node #2 is so full 
> I'm not sure any major rebuild or compaction will be susccessful.  The other 
> nodes seem to be handiling things ok although they are still heavily loaded.
> 
> In general, try to get free space on the nodes by using compaction, moving 
> files to a new mount etc so that you can get repair to run.
> -I'll try adding an EBS volume or two to the bloated node and see if that 
> allows me to successfuly compact/repair.
> -If I add another volume to that node, then run some compactions and such to 
> the point where everything fits on the main volume again, I may just replace 
> that node with a new one.  Can I move things off of and then kill the ebs 
> volume?
> 
> Other thoughts/notes:
> This cluster has a super high write load currently since I'm still building 
> it out.  I frequently update every row in my CFs
> I almost certainly need to add more capacity (more nodes).  The general plan 
> is to get everything sort of working first though, since repairs and such are 
> currently failing it seems like a bad time to add more nodes.
> 
> Thanks,
> Luke
> 
> From: aaron morton 
> <aa...@thelastpickle.com<mailto:aa...@thelastpickle.com><mailto:aa...@thelastpickle.com>>
> Reply-To: 
> "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>"
>  
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>>
> To: 
> "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>"
>  
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>>
> Subject: Re: Nodes not picking up data on repair, disk loaded unevenly
> 
> You are basically in trouble. If you can nuke it and start again it would be 
> easier. If you want to figure out how to get out of it keep the cluster up 
> and have a play.
> 
> 
> -What I think the solution should be:
> You want to get repair to work before you start deleting data.
> 
> At ~840GB I'm probably running close
> to the max load I should have on a node,
> roughly 300GB to 400GB is the max load
> 
> On node #1 I was able to successfully run a scrub and
> major compaction,
> In this situation running a major compaction is now what you want. it creates 
> a huge file that can only be compacted if there is enough space for another 
> huge file. Smaller files only need small space to be compacted.
> 
> Is there something I should be looking for in the logs to verify that the
> repair was successful?
> grep for "repair command"
> 
> The shortcut on EC2 is add an EBS volumn, tell cassandra it can store stuff 
> there (in the yaml) and buy some breathing room.
> 
> 
> What version are you using ?
> 
> Has there been times when nodes were down ?
> 
> Clear as much space as possible from the disk. Check for snapshots in all 
> KS's.
> 
> What KS's (including the system KS) are taking up the most space ? Are there 
> a lot of hints in the system KS (they are not replicated)?
> 
> Try to get a feel for what CF's are taking up the space or not as the case my 
> be. Look in nodetool cfstats to see how big the rows are.
> 
> I you have enabled compression run nodetool upgradetables to compress them.
> 
> 
> In general, try to get free space on the nodes by using compaction, moving 
> files to a new mount etc so that you can get repair to run.
> 
> Cheers
> 
> 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 6/06/2012, at 6:53 AM, Luke Hospadaruk wrote:
> 
> I have a 4-node cluster with one keyspace (aside from the system keyspace)
> with the replication factor set to 4.  The disk usage between the nodes is
> pretty wildly different and I'm wondering why.  It's becoming a problem
> because one node is getting to the point where it sometimes fails to
> compact because it doesn't have enough space.
> 
> I've been doing a lot of experimenting with the schema, adding/dropping
> things, changing settings around (not ideal I realize, but we're still in
> development).
> 
> In an ideal world, I'd launch another cluster (this is all hosted in
> amazon), copy all the data to that, and just get rid of my current
> cluster, but the current cluster is in use by some other parties so
> rebuilding everything is impractical (although possible if it's the only
> reliable solution).
> 
> $ nodetool -h localhost ring
> Address     DC        Rack  Status State  Load       Owns   Token
> 
> 
> 1.xx.xx.xx   Cassandra   rack1       Up     Normal  837.8 GB   25.00%  0
> 
> 2.xx.xx.xx   Cassandra   rack1       Up     Normal  1.17 TB    25.00%
> 42535295865117307932921825928971026432
> 3.xx.xx.xx   Cassandra   rack1       Up     Normal  977.23 GB  25.00%
> 85070591730234615865843651857942052864
> 4.xx.xx.xx   Cassandra   rack1       Up     Normal  291.2 GB   25.00%
> 127605887595351923798765477786913079296
> 
> -Problems I'm having:
> Nodes are running out of space and are apparently unable to perform
> compactions because of it.  These machines have 1.7T total space each.
> 
> The logs for node #2 have a lot of warnings about insufficient space for
> compaction.  Node number 4 was so extremely out of space (cassandra was
> failing to start because of it)that I removed all the SSTables for one of
> the less essential column families just to bring it back online.
> 
> 
> I have (since I started noticing these issues) enabled compression for all
> my column families.  On node #1 I was able to successfully run a scrub and
> major compaction, so I suspect that the disk usage for node #1 is about
> where all the other nodes should be.  At ~840GB I'm probably running close
> to the max load I should have on a node, so I may need to launch more
> nodes into the cluster, but I'd like to get things straightened out before
> I introduce more potential issues (token moving, etc).
> 
> Node #4 seems not to be picking up all the data it should have (since
> repication factor == number of nodes, the load should be roughly the
> same?).  I've run repairs on that node to seemingly no avail (after repair
> finishes, it still has about the same disk usage, which is much too low).
> 
> 
> -What I think the solution should be:
> One node at a time:
> 1) nodetool drain the node
> 2) shut down cassandra on the node
> 3) wipe out all the data in my keyspace on the node
> 4) bring cassandra back up
> 5) nodetool repair
> 
> -My concern:
> This is basically what I did with node #4 (although I didn't drain, and I
> didn't wipe the entire keyspace), and it doesn't seem to have regained all
> the data it's supposed to have after the repair. The column family should
> have at least 200-300GB of data, and the SSTables in the data directory
> only total about 11GB, am I missing something?
> 
> Is there a way to verify that a node _really_ has all the data it's
> supposed to have?
> 
> I don't want to do this process to each node and discover at the end of it
> that I've lost a ton of data.
> 
> Is there something I should be looking for in the logs to verify that the
> repair was successful?  If I do a 'nodetool netstats' during the repair I
> don't see any streams going in or out of node #4.
> 
> Thanks,
> Luke
> 
> 
>

Re: Nodes not picking up data on repair, disk loaded unevenly

Reply via email to