Re: Unable to repair a node

Stephen Connolly Sun, 14 Aug 2011 07:17:50 -0700

i am always wondering why people run clusters with number of nodes == rf

i thought you needed to have number of nodes > rf ti gave any sensible
behaviour... but i am no expert at all


- Stephen

---
Sent from my Android phone, so random spelling mistakes, random nonsense
words and other nonsense are a direct result of using swype to type on the
screen
On 14 Aug 2011 11:30, "Philippe" <watche...@gmail.com> wrote:
> Hello, I've been fighting with my cluster for a couple days now... Running
> 0.8.1.3, using Hector and loadblancing requests across all nodes.
> My question is : how do I get my node back under control so that it runs
> like the other two nodes.
>
>
> It's a 3 node, RF=3 cluster with reads & writes at LC=QUORUM, I only have
> counter columns inside super columns. There are 6 keyspaces, each has
about
> 10 column families. I'm using the BOP. Before the sequence of events
> described below, I was writing at CL=ALL and reading at CL=ONE. I've
> launched repairs multiple times and they have failed for various reasons,
> one of them being hitting the limit on number of open files. I've raised
it
> to 32768 now. I've probably launched repairs when a repair was already
> running on the node. At some point compactions were throttled to 16MB / s,
> I've removed this limit.
>
> The problem is that one of my nodes is now impossible to repair (no such
> problem with the two others). The load is about 90GB, it should be a
> balanced ring but the other nodes are at 60GB. Each repair basically
> generates thousands of pending compactions of various types (SSTable
build,
> minor, major & validation) : it spikes up to 4000 thousands, levels then
> spikes up to 8000.Previously, I hit linux limits and had to restart the
node
> but it doesn't look like the repairs have been improving anything time
after
> time.
> At the same time,
>
> - the number of SSTables for some keyspaces goes dramatically up (from 3
> or 4 to several dozens).
> - the commit log keeps increasing in size, I'm at 4.3G now, it went up to
> 40G when the compaction was throttled at 16MB/s. On the other nodes it's
> around 1GB at most
> - the data directory is bigger than on the other nodes. I've seen it go
> up to 480GB when the compaction was throttled at 16MB/s
>
>
> Compaction stats:
> pending tasks: 5954
> compaction type keyspace column family bytes compacted
> bytes total progress
> ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_17
> 569432689 596621002 95.44%
> MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20
> 2751906910 5806164726 47.40%
> ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_20
> 2570106876 2776508919 92.57%
> ValidationROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_19
> 3010471905 6517183774 46.19%
> MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_15
> 4132 303015882 0.00%
> MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_18
> 36302803 595278385 6.10%
> MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_17
> 24671866 70959088 34.77%
> MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20
> 15515781 692029872 2.24%
> MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20
> 1607953684 6606774662 24.34%
> ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_20
> 895043380 2776306015 32.24%
>
> My current lsof count for the cassandra user is
> root@xxx:/logs/cassandra# lsof -u cassandra| wc -l
> 13191
>
> What's even weirder is that currently I have 9 compactions running but CPU
> is throttled at 1/number of cores half the time (while > 80% the rest of
the
> time). Could this be because other repairs are happening in the ring ?
> Exemple (vmstat 2)
> 7 2 0 177632 1596 13868416 0 0 9060 61 5963 5968 40 7 53
> 0
> 7 0 0 165376 1600 13880012 0 0 41422 28 14027 4608 81 17
> 1 0
> 8 0 0 159820 1592 13880036 0 0 26830 22 10161 10398 76 19
> 4 1
> 6 0 0 161792 1592 13882312 0 0 20046 42 7272 4599 81 17 2
> 0
> 2 0 0 164960 1564 13879108 0 0 17404 26559 6172 3638 79 18 2
> 0
> 2 0 0 162344 1564 13867888 0 0 6 0 2014 2150 40 2 58
> 0
> 1 1 0 159864 1572 13867952 0 0 0 41668 958 581 27 0 72
> 1
> 1 0 0 161972 1572 13867952 0 0 0 89 661 443 17 0 82
> 1
> 1 0 0 162128 1572 13867952 0 0 0 20 482 398 17 0 83
> 0
> 2 0 0 162276 1572 13867952 0 0 0 788 485 395 18 0 82
> 0
> 1 0 0 173896 1572 13867952 0 0 0 29 547 461 17 0 83
> 0
> 1 0 0 163052 1572 13867920 0 0 0 0 741 620 18 1 81
> 0
> 1 0 0 162588 1580 13867948 0 0 0 32 523 387 17 0 82
> 0
> 13 0 0 168272 1580 13877140 0 0 12872 269 8056 6725 56 9 34
> 0
> 44 1 0 202536 1612 13835956 0 0 26606 530 7946 3887 79 19 2
> 0
> 48 1 0 406640 1612 13631740 0 0 22006 310 8605 3705 80 18 2
> 0
> 9 1 0 340300 1620 13697560 0 0 19530 103 8101 3984 84 14 1
> 0
> 2 0 0 297768 1620 13738036 0 0 12438 10 4115 2628 57 9 34
> 0
>
> Thanks

Re: Unable to repair a node

Reply via email to