Re: Nodes get stuck

Sylvain Lebresne Wed, 21 Aug 2013 03:22:03 -0700

A thread dump on one of the machine that has a suspiciously high CPU might
help figuring out what it is that is taking all that CPU.



On Wed, Aug 21, 2013 at 8:57 AM, Keith Wright <kwri...@nanigans.com> wrote:

> Some last minute info on this to hopefully enlighten.  We are doing ~200
> reads and writes across our 7 node SSD cluster right now (usually can do
> closer to 20K reads at least) and seeing CPU load as follows for the nodes
> (with some par new to give an idea of GC):
>
> 001 – 1200%   (Par New at 120 ms / sec)
> 002 – 6% (Par New at 0)
> 003 – 600% (Par New at 45 ms / sec)
> 004 – 900%
> 005 – 500%
> 006 – 10%
> 007 – 130%
>
> There are no compactions running on 001 however I did see a broken pipe
> error in the logs there (see below).  Netstats for 001 shows nothing
> pending.  It appears that all of the load/latency is related to one column
> family.  You can see cfstats & cfhistograms output below and note that we
> are using LCS.  I have brought the odd cfhistograms behavior to the thread
> before and am not sure what's going on there.  We are in a production down
> situation right now so any help would be much appreciated!!!
>
> Column Family: global_user
> SSTable count: 7546
> SSTables in each level: [2, 10, 106/100, 453, 6975, 0, 0]
> Space used (live): 83848742562
> Space used (total): 83848742562
> Number of Keys (estimate): 549792896
> Memtable Columns Count: 526746
> Memtable Data Size: 117408252
> Memtable Switch Count: 0
> Read Count: 11673
> Read Latency: 1950.062 ms.
> Write Count: 118588
> Write Latency: 0.080 ms.
> Pending Tasks: 0
> Bloom Filter False Positives: 4322
> Bloom Filter False Ratio: 0.84066
> Bloom Filter Space Used: 383507440
> Compacted row minimum size: 73
> Compacted row maximum size: 2816159
> Compacted row mean size: 324
>
> [kwright@lxpcas001 ~]$ nodetool cfhistograms users global_user
> users/global_user histograms
> Offset      SSTables     Write Latency      Read Latency          Row Size
>      Column Count
> 1               8866                 0                 0                 0
>              3420
> 2               1001                 0                 0                 0
>          99218975
> 3               1249                 0                 0                 0
>         319713048
> 4               1074                 0                 0                 0
>          25073893
> 5                132                 0                 0                 0
>          15359199
> 6                  0                 0                 0                 0
>          27794925
> 7                  0                12                 0                 0
>           7954974
> 8                  0                23                 0                 0
>           7733934
> 10                 0               184                 0                 0
>          13276275
> 12                 0               567                 0                 0
>           9077508
> 14                 0              1098                 0                 0
>           5879292
> 17                 0              2722                 0                 0
>           5693471
> 20                 0              4379                 0                 0
>           3204131
> 24                 0              8928                 0                 0
>           2614995
> 29                 0             13525                 0                 0
>           1824584
> 35                 0             16759                 0                 0
>           1265911
> 42                 0             17048                 0                 0
>            868075
> 50                 0             14162                 5                 0
>            596417
> 60                 0             11806                15                 0
>            467747
> 72                 0              8569               108                 0
>            354276
> 86                 0              7042               276               227
>            269987
> 103                0              5936               372              2972
>            218931
> 124                0              4538               577               157
>            181360
> 149                0              2981              1076           7388090
>            144298
> 179                0              1929              1529          90535838
>            116628
> 215                0              1081              1450         182701876
>             93378
> 258                0               499              1125         141393480
>             74052
> 310                0               124               756          18883224
>             58617
> 372                0                31               460          24599272
>             45453
> 446                0                25               247          23516772
>             34310
> 535                0                10               146          13987584
>             26168
> 642                0                20               194          12091458
>             19965
> 770                0                 8               196           9269197
>             14649
> 924                0                 9               340           8082898
>             11015
> 1109               0                 9               225           4762865
>              8058
> 1331               0                 9               154           3330110
>              5866
> 1597               0                 8               144           2367615
>              4275
> 1916               0                 1               188           1633608
>              3087
> 2299               0                 4               216           1139820
>              2196
> 2759               0                 5               201            819019
>              1456
> 3311               0                 4               194            600522
>              1135
> 3973               0                 6               181            454566
>               786
> 4768               0                13               136            353886
>               587
> 5722               0                 6               152            280630
>               400
> 6866               0                 5                80            225545
>               254
> 8239               0                 6               112            183285
>               138
> 9887               0                 0                68            149820
>               109
> 11864              0                 5                99            121722
>                66
> 14237              0                57                86             98352
>                50
> 17084              0                18                99             79085
>                35
> 20501              0                 1                93             62423
>                11
> 24601              0                 0                61             49471
>                 9
> 29521              0                 0                69             37395
>                 5
> 35425              0                 4                56             28611
>                 6
> 42510              0                 0                57             21876
>                 1
> 51012              0                 9                60             16105
>                 0
> 61214              0                 0                52             11996
>                 0
> 73457              0                 0                50              8791
>                 0
> 88148              0                 0                38              6430
>                 0
> 105778             0                 0                25              4660
>                 0
> 126934             0                 0                15              3308
>                 0
> 152321             0                 0                 2              2364
>                 0
> 182785             0                 0                 0              1631
>                 0
> 219342             0                 0                 0              1156
>                 0
> 263210             0                 0                 0               887
>                 0
> 315852             0                 0                 0               618
>                 0
> 379022             0                 0                 0               427
>                 0
> 454826             0                 0                 0               272
>                 0
> 545791             0                 0                 0               168
>                 0
> 654949             0                 0                 0               115
>                 0
> 785939             0                 0                 0                61
>                 0
> 943127             0                 0                 0                58
>                 0
> 1131752            0                 0                 0                34
>                 0
> 1358102            0                 0                 0                19
>                 0
> 1629722            0                 0                 0                 9
>                 0
> 1955666            0                 0                 0                 4
>                 0
> 2346799            0                 0                 0                 5
>                 0
> 2816159            0                 0                 0                 2
>                 0
> 3379391            0                 0                 0                 0
>                 0
> 4055269            0                 0                 0                 0
>                 0
> 4866323            0                 0                 0                 0
>                 0
> 5839588            0                 0                 0                 0
>                 0
> 7007506            0                 0                 0                 0
>                 0
> 8409007            0                 0                 0                 0
>                 0
> 10090808           0                 0                 0                 0
>                 0
> 12108970           0                 0                 0                 0
>                 0
> 14530764           0                 0                 0                 0
>                 0
> 17436917           0                 0                 0                 0
>                 0
> 20924300           0                 0                 0                 0
>                 0
> 25109160           0                 0                 0                 0
>                 0
>
> ERROR [WRITE-/10.8.44.98] 2013-08-21 06:50:25,450
> OutboundTcpConnection.java (line 197) error writing to /10.8.44.98
> java.lang.RuntimeException: java.io.IOException: Broken pipe
> at
> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:59)
> at
> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:30)
> at
> org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
> at org.apache.cassandra.db.Row$RowSerializer.serialize(Row.java:62)
> at
> org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:78)
> at
> org.apache.cassandra.db.ReadResponseSerializer.serialize(ReadResponse.java:69)
> at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:131)
> at
> org.apache.cassandra.net.OutboundTcpConnection.write(OutboundTcpConnection.java:221)
> at
> org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:186)
> at
> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:144)
> Caused by: java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
> at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
> at java.nio.channels.Channels.writeFully(Channels.java:98)
> at java.nio.channels.Channels.access$000(Channels.java:61)
> at java.nio.channels.Channels$1.write(Channels.java:174)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:297)
> at
> org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:244)
> at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:99)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at org.apache.cassandra.utils.ByteBufferUtil.write(ByteBufferUtil.java:328)
> at
> org.apache.cassandra.utils.ByteBufferUtil.writeWithLength(ByteBufferUtil.java:315)
> at
> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:55)
> ... 9 more
>
>
> From: Keith Wright <kwri...@nanigans.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Wednesday, August 21, 2013 2:35 AM
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Nodes get stuck
>
> Still looking for help!  We have stopped almost ALL traffic to the cluster
> and still some nodes are showing almost 1000% CPU for cassandra with no
> iostat activity.   We were running cleanup on one of the nodes that was not
> showing load spikes however now when I attempt to stop cleanup there via
> nodetool stop cleanup the java task for stopping cleanup itself is at 1500%
> and has not returned after 2 minutes.  This is VERY odd behavior.  Any
> ideas?  Hardware failure?  Network?  We are not seeing anything there but
> wanted to get ideas.
>
> Thanks
>
> From: Keith Wright <kwri...@nanigans.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Tuesday, August 20, 2013 8:32 PM
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Nodes get stuck
>
> Hi all,
>
>     We are using C* 1.2.4 with Vnodes and SSD.  We have seen behavior
> recently where 3 of our nodes get locked up in high load in what appears to
> be a GC spiral while the rest of the cluster (7 total nodes) appears fine.
>  When I run a tpstats, I see the following (assuming tpstats returns at
> all) and top shows cassandra pegged at 2000%.  Obviously we have a large
> number of blocked reads.  In the past I could explain this due to
> unexpectedly wide rows however we have handled that.  When the cluster
> starts to meltdown like this its hard to get visibility into what's going
> on and what triggered the issue as everything starts to pile on.  Opscenter
> becomes unusable and because the effected nodes are in GC pressure, getting
> any data via nodetool or JMX is also difficult.  What do people do to
> handle these situations?  We are going to start graphing
> reads/writes/sec/CF to Ganglia in the hopes that it helps.
>
> Thanks
>
> Pool Name                    Active   Pending      Completed   Blocked
>  All time blocked
> ReadStage                       256       381     1245117434         0
>             0
> RequestResponseStage              0         0     1161495947         0
>             0
> MutationStage                     8         8      481721887         0
>             0
> ReadRepairStage                   0         0       85770600         0
>             0
> ReplicateOnWriteStage             0         0       21896804         0
>             0
> GossipStage                       0         0        1546196         0
>             0
> AntiEntropyStage                  0         0           5009         0
>             0
> MigrationStage                    0         0           1082         0
>             0
> MemtablePostFlusher               0         0          10178         0
>             0
> FlushWriter                       0         0           6081         0
>          2075
> MiscStage                         0         0             57         0
>             0
> commitlog_archiver                0         0              0         0
>             0
> AntiEntropySessions               0         0              0         0
>             0
> InternalResponseStage             0         0              6         0
>             0
> HintedHandoff                     1         1            246         0
>             0
>
> Message type           Dropped
> RANGE_SLICE                482
> READ_REPAIR                  0
> BINARY                       0
> READ                    515762
> MUTATION                    39
> _TRACE                       0
> REQUEST_RESPONSE            29
>
>

Re: Nodes get stuck

Reply via email to