i am always wondering why people run clusters with number of nodes == rf i thought you needed to have number of nodes > rf ti gave any sensible behaviour... but i am no expert at all
- Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 14 Aug 2011 11:30, "Philippe" <watche...@gmail.com> wrote: > Hello, I've been fighting with my cluster for a couple days now... Running > 0.8.1.3, using Hector and loadblancing requests across all nodes. > My question is : how do I get my node back under control so that it runs > like the other two nodes. > > > It's a 3 node, RF=3 cluster with reads & writes at LC=QUORUM, I only have > counter columns inside super columns. There are 6 keyspaces, each has about > 10 column families. I'm using the BOP. Before the sequence of events > described below, I was writing at CL=ALL and reading at CL=ONE. I've > launched repairs multiple times and they have failed for various reasons, > one of them being hitting the limit on number of open files. I've raised it > to 32768 now. I've probably launched repairs when a repair was already > running on the node. At some point compactions were throttled to 16MB / s, > I've removed this limit. > > The problem is that one of my nodes is now impossible to repair (no such > problem with the two others). The load is about 90GB, it should be a > balanced ring but the other nodes are at 60GB. Each repair basically > generates thousands of pending compactions of various types (SSTable build, > minor, major & validation) : it spikes up to 4000 thousands, levels then > spikes up to 8000.Previously, I hit linux limits and had to restart the node > but it doesn't look like the repairs have been improving anything time after > time. > At the same time, > > - the number of SSTables for some keyspaces goes dramatically up (from 3 > or 4 to several dozens). > - the commit log keeps increasing in size, I'm at 4.3G now, it went up to > 40G when the compaction was throttled at 16MB/s. On the other nodes it's > around 1GB at most > - the data directory is bigger than on the other nodes. I've seen it go > up to 480GB when the compaction was throttled at 16MB/s > > > Compaction stats: > pending tasks: 5954 > compaction type keyspace column family bytes compacted > bytes total progress > ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_17 > 569432689 596621002 95.44% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20 > 2751906910 5806164726 47.40% > ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_20 > 2570106876 2776508919 92.57% > ValidationROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_19 > 3010471905 6517183774 46.19% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_15 > 4132 303015882 0.00% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_18 > 36302803 595278385 6.10% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_17 > 24671866 70959088 34.77% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20 > 15515781 692029872 2.24% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20 > 1607953684 6606774662 24.34% > ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_20 > 895043380 2776306015 32.24% > > My current lsof count for the cassandra user is > root@xxx:/logs/cassandra# lsof -u cassandra| wc -l > 13191 > > What's even weirder is that currently I have 9 compactions running but CPU > is throttled at 1/number of cores half the time (while > 80% the rest of the > time). Could this be because other repairs are happening in the ring ? > Exemple (vmstat 2) > 7 2 0 177632 1596 13868416 0 0 9060 61 5963 5968 40 7 53 > 0 > 7 0 0 165376 1600 13880012 0 0 41422 28 14027 4608 81 17 > 1 0 > 8 0 0 159820 1592 13880036 0 0 26830 22 10161 10398 76 19 > 4 1 > 6 0 0 161792 1592 13882312 0 0 20046 42 7272 4599 81 17 2 > 0 > 2 0 0 164960 1564 13879108 0 0 17404 26559 6172 3638 79 18 2 > 0 > 2 0 0 162344 1564 13867888 0 0 6 0 2014 2150 40 2 58 > 0 > 1 1 0 159864 1572 13867952 0 0 0 41668 958 581 27 0 72 > 1 > 1 0 0 161972 1572 13867952 0 0 0 89 661 443 17 0 82 > 1 > 1 0 0 162128 1572 13867952 0 0 0 20 482 398 17 0 83 > 0 > 2 0 0 162276 1572 13867952 0 0 0 788 485 395 18 0 82 > 0 > 1 0 0 173896 1572 13867952 0 0 0 29 547 461 17 0 83 > 0 > 1 0 0 163052 1572 13867920 0 0 0 0 741 620 18 1 81 > 0 > 1 0 0 162588 1580 13867948 0 0 0 32 523 387 17 0 82 > 0 > 13 0 0 168272 1580 13877140 0 0 12872 269 8056 6725 56 9 34 > 0 > 44 1 0 202536 1612 13835956 0 0 26606 530 7946 3887 79 19 2 > 0 > 48 1 0 406640 1612 13631740 0 0 22006 310 8605 3705 80 18 2 > 0 > 9 1 0 340300 1620 13697560 0 0 19530 103 8101 3984 84 14 1 > 0 > 2 0 0 297768 1620 13738036 0 0 12438 10 4115 2628 57 9 34 > 0 > > Thanks