On 11/6/2010 1:48 PM, Jonathan Ellis wrote: > On Fri, Nov 5, 2010 at 8:03 PM, Chip Salzenberg <rev.c...@gmail.com> wrote: >> In the below "nodetool ring" output, machine 18 was told to loadbalance over >> an hour ago. It won't actually leave the ring. When I first told it to >> loadbalance, the cluster was under heavy write load; I've turned off the >> write load, but the node won't actually leave, still. Help? > What version is the cluster on?
You mean, the Cassandra version? 0.7 beta3. > Did any of the nodes log any dropped messages? I didn't keep timestamps of the maintenance steps, so I will be unable to be sure which log entries correspond to which failure states. I did find dropped message log entries on node X.22, though. Here's the batch that happened more or less the time things went wrong: WARN [ScheduledTasks:1] 2010-11-05 17:15:03,294 MessagingService.java (line 515) Dropped 9122 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:05,434 MessagingService.java (line 515) Dropped 16658 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:07,084 MessagingService.java (line 515) Dropped 2167 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:09,371 MessagingService.java (line 515) Dropped 28011 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:11,111 MessagingService.java (line 515) Dropped 1139 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:13,330 MessagingService.java (line 515) Dropped 1203 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:15,241 MessagingService.java (line 515) Dropped 4494 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:16,925 MessagingService.java (line 515) Dropped 2277 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:18,839 MessagingService.java (line 515) Dropped 17376 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:23,385 MessagingService.java (line 515) Dropped 18714 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:25,261 MessagingService.java (line 515) Dropped 18952 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:29,006 MessagingService.java (line 515) Dropped 25137 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:30,859 MessagingService.java (line 515) Dropped 1 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:34,418 MessagingService.java (line 515) Dropped 2580 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:35,816 MessagingService.java (line 515) Dropped 4317 messages in the last 1000ms I looked for similar messages on node X.21 but didn't find any. It seems that node states can become weird or wedged -- bordering on internally inconsistent -- and cleanup operations on the order of "shutdown the node manually and force-remove it from the ring" are commonplace. I hope I'm missing something. Am I to understand that ring maintenance requests can just fail when partially complete, in the same manner as a regular insert might fail, perhaps due to inter-node RPC overflow? > Any other error or warning messages? "Cannot provide an optimal BloomFilter" several times, and "Schema definitions were defined both locally and in cassandra.yaml" on startup. >> (It also collected 3.6G of load even though automatic bootstrapping is >> disabled -- but this node had belonged to the cluster before, so maybe >> cleaning out /var/lib/cassandra/* wasn't enough to prevent the node from >> rejoining and taking data responsibility?) > Assuming that contains both commitlog and data directories, that > should do it. You can tell by what it logs when it first starts up, > if it's asking other nodes to send it data. It would appear, then, that Cassandra isn't designed to be operated and understood without constant log watching of all nodes.