Hello, I've been fighting with my cluster for a couple days now... Running 0.8.1.3, using Hector and loadblancing requests across all nodes. My question is : how do I get my node back under control so that it runs like the other two nodes.
It's a 3 node, RF=3 cluster with reads & writes at LC=QUORUM, I only have counter columns inside super columns. There are 6 keyspaces, each has about 10 column families. I'm using the BOP. Before the sequence of events described below, I was writing at CL=ALL and reading at CL=ONE. I've launched repairs multiple times and they have failed for various reasons, one of them being hitting the limit on number of open files. I've raised it to 32768 now. I've probably launched repairs when a repair was already running on the node. At some point compactions were throttled to 16MB / s, I've removed this limit. The problem is that one of my nodes is now impossible to repair (no such problem with the two others). The load is about 90GB, it should be a balanced ring but the other nodes are at 60GB. Each repair basically generates thousands of pending compactions of various types (SSTable build, minor, major & validation) : it spikes up to 4000 thousands, levels then spikes up to 8000.Previously, I hit linux limits and had to restart the node but it doesn't look like the repairs have been improving anything time after time. At the same time, - the number of SSTables for some keyspaces goes dramatically up (from 3 or 4 to several dozens). - the commit log keeps increasing in size, I'm at 4.3G now, it went up to 40G when the compaction was throttled at 16MB/s. On the other nodes it's around 1GB at most - the data directory is bigger than on the other nodes. I've seen it go up to 480GB when the compaction was throttled at 16MB/s Compaction stats: pending tasks: 5954 compaction type keyspace column family bytes compacted bytes total progress ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_17 569432689 596621002 95.44% MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20 2751906910 5806164726 47.40% ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_20 2570106876 2776508919 92.57% ValidationROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_19 3010471905 6517183774 46.19% MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_15 4132 303015882 0.00% MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_18 36302803 595278385 6.10% MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_17 24671866 70959088 34.77% MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20 15515781 692029872 2.24% MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20 1607953684 6606774662 24.34% ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_20 895043380 2776306015 32.24% My current lsof count for the cassandra user is root@xxx:/logs/cassandra# lsof -u cassandra| wc -l 13191 What's even weirder is that currently I have 9 compactions running but CPU is throttled at 1/number of cores half the time (while > 80% the rest of the time). Could this be because other repairs are happening in the ring ? Exemple (vmstat 2) 7 2 0 177632 1596 13868416 0 0 9060 61 5963 5968 40 7 53 0 7 0 0 165376 1600 13880012 0 0 41422 28 14027 4608 81 17 1 0 8 0 0 159820 1592 13880036 0 0 26830 22 10161 10398 76 19 4 1 6 0 0 161792 1592 13882312 0 0 20046 42 7272 4599 81 17 2 0 2 0 0 164960 1564 13879108 0 0 17404 26559 6172 3638 79 18 2 0 2 0 0 162344 1564 13867888 0 0 6 0 2014 2150 40 2 58 0 1 1 0 159864 1572 13867952 0 0 0 41668 958 581 27 0 72 1 1 0 0 161972 1572 13867952 0 0 0 89 661 443 17 0 82 1 1 0 0 162128 1572 13867952 0 0 0 20 482 398 17 0 83 0 2 0 0 162276 1572 13867952 0 0 0 788 485 395 18 0 82 0 1 0 0 173896 1572 13867952 0 0 0 29 547 461 17 0 83 0 1 0 0 163052 1572 13867920 0 0 0 0 741 620 18 1 81 0 1 0 0 162588 1580 13867948 0 0 0 32 523 387 17 0 82 0 13 0 0 168272 1580 13877140 0 0 12872 269 8056 6725 56 9 34 0 44 1 0 202536 1612 13835956 0 0 26606 530 7946 3887 79 19 2 0 48 1 0 406640 1612 13631740 0 0 22006 310 8605 3705 80 18 2 0 9 1 0 340300 1620 13697560 0 0 19530 103 8101 3984 84 14 1 0 2 0 0 297768 1620 13738036 0 0 12438 10 4115 2628 57 9 34 0 Thanks