Re: High loads only on one node in the cluster

Rakesh Rajan Fri, 01 Nov 2013 03:15:04 -0700

Forgot to mention: All 9 nodes on Cassandra 1.2.9. Also, tpstats on the
high CPU node indicate:



   1. Pool Name                    Active   Pending      Completed
   Blocked  All time blocked
   2. ReadStage                        32      6600     3420385815
   0                 0
   3. RequestResponseStage              0         0     2094235864
   0                 0
   4. MutationStage                     0         0     3102461222
   0                 0
   5. ReadRepairStage                   0         0         438089
   0                 0
   6. *ReplicateOnWriteStage             0         0      253180440
   0          23703996*
   7. GossipStage                       0         0        5917301
   0                 0
   8. AntiEntropyStage                  0         0           1486
   0                 0
   9. MigrationStage                    0         0            143
   0                 0
   10. MemtablePostFlusher               0         0          39070
   0                 0
   11. FlushWriter                       0         0           7452
   0               927
   12. MiscStage                         0         0            257
   0                 0
   13. commitlog_archiver                0         0              0
   0                 0
   14. AntiEntropySessions               0         0              1
   0                 0
   15. InternalResponseStage             0         0             62
   0                 0
   16. HintedHandoff                     0         0           1961
   0                 0
   17.
   18. Message type           Dropped
   19. RANGE_SLICE               1681
   20. READ_REPAIR               3921
   21. BINARY                       0
   22. READ                   4103953
   23. MUTATION               2651071
   24. _TRACE                       0
   25. REQUEST_RESPONSE          3229


On Fri, Nov 1, 2013 at 3:37 PM, Rakesh Rajan <rakes...@gmail.com> wrote:

> @Tyler / @Rob,
>
> As Ashish mentioned earlier, we have 9 nodes on AWS - 6 on EastCoast and 3
> on Singapore. All 9 nodes uses EC2Snitch. The current ring ( across all
> nodes in 2 DC ) looks like this:
>
> ip11 - East Coast - m1.xlarge / us-east-1b         - Size: 83 GB - Token:
> 0
> ip21 - Singapore  - m1.xlarge / ap-southeast-1a - Size: 88 GB - Token:
> 1001
> ip12 - East Coast - m1.xlarge / us-east-1b         - Size: 45 GB -
> Token: 28356863910078205288614550619314017621
> ip13 - East Coast - m1.xlarge / us-east-1c         - Size: 93 GB -
> Token: 56713727820156410577229101238628035241
> ip22 - Singapore  - m1.xlarge / ap-southeast-1b - Size: 73 GB -
> Token: 56713727820156410577229101238628036241
> ip14 - East Coast - m1.xlarge / us-east-1c         - Size: 20 GB -
> Token: 85070591730234615865843651857942052863
> ip15 - East Coast - m1.xlarge / us-east-1d         - Size: 89 GB -
> Token: 113427455640312821154458202477256070484
> ip23 - Singapore  - m1.xlarge / ap-southeast-1b - Size: 56 GB -
> Token: 113427455640312821154458202477256071484
> ip16 - East Coast - m1.xlarge / us-east-1d         - Size: 25 GB -
> Token: 141784319550391026443072753096570088105
>
> Regarding alternating racks solution, I've the following queries:
>
> 1) By alternating racks, do you mean to alternate racks between all nodes
> in a single DC v/s multiple DCs? AWS EastCoast has 4 AZs
> and Singapore has 2 AZs. So is the final solution something like this:
> ip11 - East Coast - m1.xlarge / us-east-1b         - Token: 0
> ip21 - Singapore  - m1.xlarge / ap-southeast-1a - Token: 1001
> ip12 - East Coast - m1.xlarge / us-east-*1c*         -
> Token: 28356863910078205288614550619314017621
> ip13 - East Coast - m1.xlarge / us-east-*1d*         -
> Token: 56713727820156410577229101238628035241
> ip22 - Singapore  - m1.xlarge / ap-southeast-1b -
> Token: 56713727820156410577229101238628036241
> ip14 - East Coast - m1.xlarge / us-east-*1a*         -
> Token: 85070591730234615865843651857942052863
> ip15 - East Coast - m1.xlarge / us-east-*1b*         -
> Token: 113427455640312821154458202477256070484
> ip23 - Singapore  - m1.xlarge / ap-southeast-*1a* -
> Token: 113427455640312821154458202477256071484
> ip16 - East Coast - m1.xlarge / us-east-*1c*         -
> Token: 141784319550391026443072753096570088105
>
> Is this what you had suggested?
>
>  2) How does dynamic_snitch_badness_threshold: 0.1 effect the CPU load? On
> the node ( ip11 ) which was high CPU ( system load > 30 ), I checked the
> attribute score ( via JMX
> bean org.apache.cassandra.db:type=DynamicEndpointSnitch ) and saw the
> following:
> EastCoast:
>     *ip11 = 1.6813321647677475*
>     ip12 = 1.0003505696757231
>     ip13 = 1.1324160525509974
>     ip14 = 1.000350569675723
>     ip15 = 1.0007011393514456
>     ip16 = 1.0005258545135842
> Singapore:
>     ip21 = 1.095880806310253
>     ip22 = 1.4100000000000001
>     ip23 = 1.0953549517966696
>
> So ip11 node is indeed having higher score - but not sure why traffic is
> still going to that replica as opposed to some other node?
>
> Thanks!
>
>
>
> On Fri, Nov 1, 2013 at 3:13 PM, Ashish Tyagi <tyagi.i...@gmail.com> wrote:
>
>> Hi Evan,
>>
>> The clients connect to all nodes. We tried shutting the thrift server on
>> the affected node. Loads did not come down.
>>
>>
>>
>> On Fri, Nov 1, 2013 at 12:59 AM, Evan Weaver <e...@fauna.org> wrote:
>>
>>> Are all your clients only connecting to your first node? I would
>>> probably strace it and compare the trace to one from a lightly loaded
>>> node.
>>>
>>> On Thu, Oct 31, 2013 at 7:12 PM, Ashish Tyagi <tyagi.i...@gmail.com>
>>> wrote:
>>> > We have a 9 node cluster. 6 nodes are in one data-center and 3 nodes
>>> in the
>>> > other. All machines are Amazon M1.XLarge configuration.
>>> >
>>> > Datacenter: DC1
>>> > ==========
>>> > Address         Rack        Status State   Load            Owns
>>> > Token
>>> >
>>> > ip11  1b          Up     Normal  76.46 GB        16.67%              0
>>> > ip12  1b          Up     Normal  44.66 GB        16.67%
>>> > 28356863910078205288614550619314017621
>>> > ip13  1c          Up     Normal  85.94 GB        16.67%
>>> > 56713727820156410577229101238628035241
>>> > ip14  1c          Up     Normal  17.55 GB        16.67%
>>> > 85070591730234615865843651857942052863
>>> > ip15  1d          Up     Normal  80.74 GB        16.67%
>>> > 113427455640312821154458202477256070484
>>> > ip16  1d          Up     Normal  20.88 GB        16.67%
>>> > 141784319550391026443072753096570088105
>>> >
>>> > Datacenter: DC2
>>> > ==========
>>> > Address         Rack        Status State   Load            Owns
>>> > Token
>>> >
>>> > ip21  1a          Up     Normal  78.32 GB        0.00%
>>> 1001
>>> > ip22  1b          Up     Normal  71.23 GB        0.00%
>>> > 56713727820156410577229101238628036241
>>> > ip23  1b          Up     Normal  53.49 GB        0.00%
>>> > 113427455640312821154458202477256071484
>>> >
>>> > Problem is that node with ip address: ip11 often has 5-10 times more
>>> load
>>> > than any other node. Most of the operations are on counters. The
>>> primary
>>> > column family (which receives most writes) has a replication factor of
>>> 2 in
>>> > DataCenter DC1 and also in DataCenter DC2. The traffic is write heavy
>>> (reads
>>> > are less than 10% of total requests). We are using size-tiered
>>> compaction.
>>> > Both writes and reads happen with a consistency factor of LOCAL_QUORUM.
>>> >
>>> > More information:
>>> >
>>> > 1. cassandra.yaml - http://pastebin.com/u344fA6z
>>> > 2. Jmap heap when node under high loads - http://pastebin.com/ib3D0Pa
>>> > 3. Nodetool tpstats - http://pastebin.com/s0AS7bGd
>>> > 4. Cassandra-env.sh - http://pastebin.com/ubp4cGUx
>>> > 5. GC log lines -  http://pastebin.com/Y0TKphsm
>>> >
>>> > Am I doing anything wrong. Any pointers will be appreciated.
>>> >
>>> > Thanks in advance,
>>> > Ashish
>>>
>>
>>
>

Re: High loads only on one node in the cluster

Reply via email to