One last thought : what happens when you ctrl-c a nodetool repair ? Does it stop the repair on the server ? If not, then I think I have multiple repairs still running. Is there any way to check this ?
Thanks 2011/8/16 Philippe <watche...@gmail.com> > Even more interesting behavior : a repair on a CF has consequences on other > CFs. I didn't expect that. > > There are no writes being issued to the cluster yet the logs indicate that > > - SSTableReader has opened dozens and dozens of files, most of them > unrelated to the CF being repaired > - compactions are taking place continuously on CFs other than the one > being repaired, even CFs in other keyspaces > - I see "Sending AEService tree" messages for CF not being repaired. > > > After a very long time, I got some AES messages indicating that streaming > from node C had finished and then many minutes after that node B. And yet > the pending stream count on node B hasn't changed. > > The *-data.db files for the CF being repaired are about 70MB on-disk. > > Maybe when a stream is fully received on node B, netstats indicates that no > streams are pending but since they are not acknowledged, node A doesn't ? > > > 2011/8/16 Philippe <watche...@gmail.com> > >> I'm still trying different stuff. Here are my latest findings, maybe >> someone will find them useful: >> >> - I have been able to repair some small column families by issuing a >> repair [KS] [CF]. When testing on the ring with no writes at all, it still >> takes about 2 repairs to get "consistent" logs for all AES requests. >> - Launching a repair one the smallest CF of the biggest KS has >> triggered a flurry of compactions and streams. Some of those streams are >> for >> other CF in that keyspace !? >> - During repairs (one at a time cluster-wide), I get 25-50% io waits & >> 35%-50% cpu usage on a 6 core SATA-disk setup >> >> What is surprising to me (bug?) is that netstats shows me streams going >> from node A to node B at 0% progress. But netstats on node B doesn't show me >> any streams coming in. I'm thinking that repairs may be never ending and >> that may be messing up my compactions hence the huge pile up of compactions >> until the disk fulls. >> I know there's an issue related to failed streams & repairs, could I be >> hitting it ? >> >> Thanks >> >> 2011/8/14 Philippe <watche...@gmail.com> >> >>> @Teijo : thanks for the procedure, I hope I won't have to do that >>> >>> Peter, I'll answer inline. Thanks for the detailed answer. >>> >>> >>>> > the number of SSTables for some keyspaces goes dramatically up (from >>>> 3 or 4 >>>> > to several dozens). >>>> >>>> Typically with a long running compaction, such as that triggered by >>>> repair, that's what happens as flushed memtables accumulate. In >>>> particular for memtables with frequent flushes. >>>> >>>> Are you running with concurrent compaction enabled? >>>> >>> Yes, it is enabled. On my 0.8 cluster, cassandra.yaml has this (it's >>> commented). BTW, I have 6 cores on each server. >>> >>> #concurrent_compactors: 1 >>> >>> > the commit log keeps increasing in size, I'm at 4.3G now, it went up to >>>> 40G >>>> > when the compaction was throttled at 16MB/s. On the other nodes it's >>>> around >>>> > 1GB at most >>>> Hmmmm. The Commit Log should not be retained longer than what is >>>> required for memtables to be flushed. Is it possible you have had an >>>> out-of-disk condition and flushing has stalled? Are you seeing flushes >>>> happening in the log? >>>> >>> No I don't believe there was ever an out of disk. Yes it is flushing for >>> the first couple of hours. >>> Then, when repair seems locked up, my log is mostly filled with lines >>> such as this >>> INFO [ScheduledTasks:1] 2011-08-14 23:15:47,267 StatusLogger.java (line >>> 88) [My_Keyspace].[My_Columnfamily] 45,105541 50/50 >>> 20/20 >>> Why is that ? >>> >>> > the data directory is bigger than on the other nodes. I've seen it go >>>> up to >>>> > 480GB when the compaction was throttled at 16MB/s >>>> How much data are you writing? Is it at all plausible that the huge >>>> spike is a reflection of lots of overwriting writes that aren't being >>>> compacted? >>>> >>> No, there's no bulk loading going on at the moment and I'm pretty sure >>> there wasn't when it spiked up to that load. >>> I've never measured the load because it's a mix of counter increments and >>> new counters all the time. It's not that much though. >>> >>> >>>> Normally when disk space spikes with repair it's due to other nodes >>>> streaming huge amounts (maybe all of their data) to the node, leading >>>> to a temporary spike. But if your "real" size is expected to be 60, >>>> 480 sounds excessive. Are you sure other nodes aren't running repairs >>>> at the same time and magnifying each other's data load spikes? >>>> >>> Yes, the two other nodes were running repairs. I had them scheduled at 8 >>> hour intervals but they must have started. >>> When data is streamed from one to another, does that data go into the >>> commit log as a regular write ? >>> How much of a negative impact can that have on the repair going on on >>> this node ? >>> >>> > What's even weirder is that currently I have 9 compactions running but >>>> CPU >>>> > is throttled at 1/number of cores half the time (while > 80% the rest >>>> of the >>>> > time). Could this be because other repairs are happening in the ring ? >>>> You mean compaction is taking less CPU than it "should"? >>>> >>> Yes >>> >>> >>>> No, this should not be due to other nodes repairing. However it sounds >>>> to me like you are bottlenecking on I/O and the repairs and >>>> >>> Yes, I/O is really high on the node right now. Around 50% I/O waits. >>> >>> >>>> compactions are probably proceeding extremely slowly, probably being >>>> completely drowned out by live traffic (which is probably having an >>>> abnormally high performance impact due to data size spike). >>>> >>> Yes, the live traffic is 3 to 10x times slower during repair. Ouch... I >>> hope I won't to do this too often while in production ! >>> >>> >>>> >>>> What's your read concurrency configured on the node? What does "iostat >>>> -x -k 1" show in the average queue size column? >>> >>> Average queue size on the disk (RAID-1 + separate LVM volumes for data, >>> commit log, caches, logs)) varies between 2 and 90. I'd say the average is >>> around 30-40. Very high variation. >>> >>> >>>> Is "nodetool -h >>>> localhost tpstats" showing that ReadStage is usually "full" (@ your >>>> limit)? >>>> >>> No backlog at all in tpstats >>> >>> I've figured out how AES is logging its actions and it looks like it >>> really is going through every CF in every keyspace and doing a tree request >>> for every token range >>> So it really looks like it's just taking forever to compact stuff as it's >>> repairing. >>> I saw in another email that repairing was taking 2-3mn/ GB... it looks >>> like a lot more for my ring. Anybody else have numbers ? >>> >>> Thanks >>> >> >> >