Hi!

*Problem*
I have one node which seems to be in a bad situation, with lots of dropped
reads for a long time.

*My cluster*
I have 3 node cluster on Amazon m1.large DataStax AMI with cassandra 1.08.
RF=3, RCL=WCL=QUORUM
I use Hector which should be doing round robin of the requests between the
node.
Cluster is not under much load:

*Info*
Using OpsCenter I can see that:

    Number of read \ write request is distributed evenly between nodes.
    Disk Latency of both read and write and Disk Throughput are much worse
on one of the nodes.

*This is also visible in iostats*
"Good node"
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
xvdb              0.58     0.03   42.81    1.31  2710.90   104.62
63.82     0.02    5.96   0.48   2.14
xvdc              0.57     0.00   42.85    1.30  2712.72   104.83
63.81     0.20    4.60   0.48   2.12

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
xvdb              5.60     0.10  456.50    0.40 32729.60    28.00
71.70    19.65   43.00   0.36  16.50
xvdc              4.10     0.00  460.00    0.80 33342.40    60.80
72.49    17.55   38.09   0.35  16.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
xvdb              4.70     0.10  608.20    1.10 39217.60    77.70
64.49    26.04   42.73   0.39  23.50
xvdc              5.70     0.00  606.80    0.60 38645.60    24.00
63.66    22.89   37.69   0.38  23.10


"Bad Node"
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
xvdb              0.67     0.03   51.72    1.02  3330.21    80.62
64.67     0.06    1.19   0.60   3.16
xvdc              0.67     0.00   51.66    1.02  3329.23    80.85
64.73     0.15    2.84   0.60   3.17

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
xvdb             16.50     0.10 1484.70    0.80 88937.60    52.90
59.91   115.07   77.11   0.58  86.00
xvdc             16.20     0.00 1492.80    0.60 89701.60    43.20
60.09   102.80   69.06   0.58  86.10

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
xvdb             14.00     0.10 1260.00    0.70 81632.00    33.70
64.78    76.96   61.56   0.54  68.10
xvdc             15.50     0.10 1257.60    0.90 80932.00    63.20
64.36    88.94   70.90   0.53  67.10


*Question*
This does not make sense to me, why would one node do much more read \
writes, reading more sectors with higher utilization and wait time.
Can it be Amazon issue, I don't think so.
This of course may be the result of flushing and compactions, but it
persists for a long time, even when no compaction is happening.
What would you do to further explore or fix the problem?


Thank you very much!!
*
Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

ta...@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956

<<tokLogo.png>>

Reply via email to