So trying to understand why I'm seeing one or two nodes getting hammered when I have a batch job running. By "hammered" I mean Disk IO load is at or near 100% and CPU load is around 10.0-17.0 while the vast majority of my nodes have a CPU load between 1-2. The node(s) hammered seems to move around during the job. A few minutes on one node, then it stabilizes and the problem moves to a different one or two. Sometimes the load across all my nodes is back to 1-2 even though the job is running just as quickly as before.
The job at hand, takes a row KeyA in one CF (like InterfaceDaily), and writes it as a single column in another CF (like InterfaceDailyCompressed) to row KeyB and then deletes the original row KeyA. Using the random partitioner so I would think that read & writes would be well distributed over my 9 nodes (RF=3). The amount of data read and written is pretty small (2304 bytes written read from 288 cols x 8 bytes each). My job is multi-threaded, but I've reduced the number of threads from 20 to 5 and while things seem a little better, it's hardly 4x better and things are taking about 4x as long. Client is using Hector 1.0.5 and servers are running Cassandra 1.0.9. On the effected node will start generating logs like this: INFO [ScheduledTasks:1] 2012-06-22 17:55:18,626 MessagingService.java (line 607) 616 READ messages dropped in last 5000ms INFO [ScheduledTasks:1] 2012-06-22 17:55:18,627 StatusLogger.java (line 50) Pool Name Active Pending Blocked INFO [ScheduledTasks:1] 2012-06-22 17:55:18,627 StatusLogger.java (line 65) ReadStage 16 4628 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,627 StatusLogger.java (line 65) RequestResponseStage 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,627 StatusLogger.java (line 65) ReadRepairStage 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,628 StatusLogger.java (line 65) MutationStage 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,628 StatusLogger.java (line 65) ReplicateOnWriteStage 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,628 StatusLogger.java (line 65) GossipStage 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,628 StatusLogger.java (line 65) AntiEntropyStage 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,628 StatusLogger.java (line 65) MigrationStage 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,628 StatusLogger.java (line 65) StreamStage 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,628 StatusLogger.java (line 65) MemtablePostFlusher 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,629 StatusLogger.java (line 65) FlushWriter 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,629 StatusLogger.java (line 65) MiscStage 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,629 StatusLogger.java (line 65) InternalResponseStage 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,629 StatusLogger.java (line 65) HintedHandoff 0 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,629 StatusLogger.java (line 70) CompactionManager 0 0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,629 StatusLogger.java (line 82) MessagingService n/a 0,0 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,629 StatusLogger.java (line 86) ColumnFamily Memtable ops,data Row cache size/cap Key cache size/cap INFO [ScheduledTasks:1] 2012-06-22 17:55:18,629 StatusLogger.java (line 89) system.NodeIdInfo 0,0 0/0 1/1 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,630 StatusLogger.java (line 89) system.IndexInfo 0,0 0/0 0/1 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,630 StatusLogger.java (line 89) system.LocationInfo 0,0 0/0 1/1 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,630 StatusLogger.java (line 89) system.Versions 0,0 0/0 0/3 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,630 StatusLogger.java (line 89) system.Migrations 0,0 0/0 0/1 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,630 StatusLogger.java (line 89) system.HintsColumnFamily 0,0 0/0 1/1 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,630 StatusLogger.java (line 89) system.Schema 0,0 0/0 1/1 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,630 StatusLogger.java (line 89) spider3.MetroFabricDailyAgg 124884,59346265 0/0 8228/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,630 StatusLogger.java (line 89) spider3.FabricDailyAggCompressed 0,0 0/0 0/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,630 StatusLogger.java (line 89) spider3.IbxFabricDailyAgg 116794,45801943 0/0 16311/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,630 StatusLogger.java (line 89) spider3.DeviceDailyAgg 121624,73123817 0/0 61037/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,631 StatusLogger.java (line 89) spider3.IbxFabricDailyAggCompressed 6016,8819380 0/0 1317/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,631 StatusLogger.java (line 89) spider3.Meta 2094,16880922 0/0 8768/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,631 StatusLogger.java (line 89) spider3.FabricDailyAgg 0,0 0/0 0/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,631 StatusLogger.java (line 89) spider3.InterfaceDailyCompressed 50900,57986845 0/0 200000/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,631 StatusLogger.java (line 89) spider3.DeviceDailyAggCompressed 35282,59557567 0/0 7629/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,631 StatusLogger.java (line 89) spider3.TrunklessDeviceDailyAgg 69603,47157326 0/0 31600/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,631 StatusLogger.java (line 89) spider3.LastStat 32,936460 0/0 19777/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,631 StatusLogger.java (line 89) spider3.RegionFabricDailyAggCompressed 686,1191802 0/0 147/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,631 StatusLogger.java (line 89) spider3.TrunklessDeviceDailyAggCompressed 14144,23056640 0/0 3058/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,631 StatusLogger.java (line 89) spider3.RegionFabricDailyAgg 94085,12703937 0/0 1627/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,632 StatusLogger.java (line 89) spider3.DeviceFabricDailyAgg 10779,5941110 0/0 31033/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,632 StatusLogger.java (line 89) spider3.InterfaceDaily 280817,145511376 0/0 200000/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,632 StatusLogger.java (line 89) spider3.DeviceFabricDailyAggCompressed 15179,20592356 0/0 3229/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,632 StatusLogger.java (line 89) spider3.MetroFabricDailyAggCompressed 3032,4568080 0/0 642/200000 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,632 StatusLogger.java (line 89) OpsCenter.rollups60 218594,158987534 0/0 50/50 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,632 StatusLogger.java (line 89) OpsCenter.events_timeline 0,0 0/0 0/5 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,632 StatusLogger.java (line 89) OpsCenter.rollups300 91185,71822435 0/0 32/50 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,632 StatusLogger.java (line 89) OpsCenter.pdps 312736,62115027 0/0 0/300 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,632 StatusLogger.java (line 89) OpsCenter.events 0,0 0/0 6/50 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,632 StatusLogger.java (line 89) OpsCenter.settings 0,0 0/0 0/50 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,633 StatusLogger.java (line 89) OpsCenter.rollups7200 3908,3024712 0/0 0/50 INFO [ScheduledTasks:1] 2012-06-22 17:55:18,633 StatusLogger.java (line 89) OpsCenter.rollups86400 796,608956 0/0 0/50 Looking at the logs, I don't see any compactions or repairs going on. Anyways, how should I debug this further to see what is causing this? -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin "carpe diem quam minimum credula postero"