Hi! *Problem* I have one node which seems to be in a bad situation, with lots of dropped reads for a long time.
*My cluster* I have 3 node cluster on Amazon m1.large DataStax AMI with cassandra 1.08. RF=3, RCL=WCL=QUORUM I use Hector which should be doing round robin of the requests between the node. Cluster is not under much load: *Info* Using OpsCenter I can see that: Number of read \ write request is distributed evenly between nodes. Disk Latency of both read and write and Disk Throughput are much worse on one of the nodes. *This is also visible in iostats* "Good node" Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdb 0.58 0.03 42.81 1.31 2710.90 104.62 63.82 0.02 5.96 0.48 2.14 xvdc 0.57 0.00 42.85 1.30 2712.72 104.83 63.81 0.20 4.60 0.48 2.12 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdb 5.60 0.10 456.50 0.40 32729.60 28.00 71.70 19.65 43.00 0.36 16.50 xvdc 4.10 0.00 460.00 0.80 33342.40 60.80 72.49 17.55 38.09 0.35 16.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdb 4.70 0.10 608.20 1.10 39217.60 77.70 64.49 26.04 42.73 0.39 23.50 xvdc 5.70 0.00 606.80 0.60 38645.60 24.00 63.66 22.89 37.69 0.38 23.10 "Bad Node" Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdb 0.67 0.03 51.72 1.02 3330.21 80.62 64.67 0.06 1.19 0.60 3.16 xvdc 0.67 0.00 51.66 1.02 3329.23 80.85 64.73 0.15 2.84 0.60 3.17 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdb 16.50 0.10 1484.70 0.80 88937.60 52.90 59.91 115.07 77.11 0.58 86.00 xvdc 16.20 0.00 1492.80 0.60 89701.60 43.20 60.09 102.80 69.06 0.58 86.10 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdb 14.00 0.10 1260.00 0.70 81632.00 33.70 64.78 76.96 61.56 0.54 68.10 xvdc 15.50 0.10 1257.60 0.90 80932.00 63.20 64.36 88.94 70.90 0.53 67.10 *Question* This does not make sense to me, why would one node do much more read \ writes, reading more sectors with higher utilization and wait time. Can it be Amazon issue, I don't think so. This of course may be the result of flushing and compactions, but it persists for a long time, even when no compaction is happening. What would you do to further explore or fix the problem? Thank you very much!! * Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956
<<tokLogo.png>>