Re: Node bootstrap

Ruchir Jha Tue, 12 Aug 2014 06:53:35 -0700

Still having issues with node bootstrapping. The new node just died,
because it Full Gced, the nodes it had actual streams with noticed its
down. After the full gc finished the new node printed this log :


ERROR 02:52:36,259 Stream failed because /10.10.20.35 died or was
restarted/removed (streams may still be active in background, but further
streams won't be started)

Here 10.10.20.35 is an existing node, the new guy was streaming from. A
similar log was printed for every other node on the cluster. Why did the
new node just exit after the FGC pause?

We have heap dumps enabled on Full GC's and this are the top offenders on
the new node. A new entry that I noticed is the CompressionMetaData chunks.
Anything I can do to optimize that?

 num     #instances         #bytes  class name
----------------------------------------------
   1:      42508421     4818885752  [B
   2:      65860543     3161306064  java.nio.HeapByteBuffer
   3:     124361093     2984666232
 org.apache.cassandra.io.compress.CompressionMetadata$Chunk
   4:      29745665     1427791920
 edu.stanford.ppl.concurrent.SnapTreeMap$Node
   5:      29810362      953931584  org.apache.cassandra.db.Column
   6:         31623      498012768
 [Lorg.apache.cassandra.io.compress.CompressionMetadata$Chunk;



On Tue, Aug 5, 2014 at 2:59 PM, Ruchir Jha <ruchir....@gmail.com> wrote:

> Also, right now the "top" command shows that we are at 500-700% CPU, and
> we have 23 total processors, which means we have a lot of idle CPU left
> over, so throwing more threads at compaction and flush should alleviate the
> problem?
>
>
> On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha <ruchir....@gmail.com> wrote:
>
>>
>> Right now, we have 6 flush writers and compaction_throughput_mb_per_sec
>> is set to 0, which I believe disables throttling.
>>
>> Also, Here is the iostat -x 5 5 output:
>>
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda              10.00  1450.35   50.79   55.92  9775.97 12030.14
>> 204.34     1.56   14.62   1.05  11.21
>> dm-0              0.00     0.00    3.59   18.82   166.52   150.35
>>  14.14     0.44   19.49   0.54   1.22
>> dm-1              0.00     0.00    2.32    5.37    18.56    42.98
>> 8.00     0.76   98.82   0.43   0.33
>> dm-2              0.00     0.00  162.17 5836.66 32714.46 47040.87
>>  13.30     5.57    0.90   0.06  36.00
>> sdb               0.40  4251.90  106.72  107.35 23123.61 35204.09
>> 272.46     4.43   20.68   1.29  27.64
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>                 14.64   10.75    1.81   13.50    0.00   59.29
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda              15.40  1344.60   68.80  145.60  4964.80 11790.40
>>  78.15     0.38    1.80   0.80  17.10
>> dm-0              0.00     0.00   43.00 1186.20  2292.80  9489.60
>> 9.59     4.88    3.90   0.09  11.58
>> dm-1              0.00     0.00    1.60    0.00    12.80     0.00
>> 8.00     0.03   16.00   2.00   0.32
>> dm-2              0.00     0.00  197.20 17583.80 35152.00 140664.00
>> 9.89  2847.50  109.52   0.05  93.50
>> sdb              13.20 16552.20  159.00  742.20 32745.60 129129.60
>> 179.62    72.88   66.01   1.04  93.42
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>                   15.51   19.77    1.97    5.02    0.00   57.73
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda              16.20   523.40   60.00  285.00  5220.80  5913.60
>>  32.27     0.25    0.72   0.60  20.86
>> dm-0              0.00     0.00    0.80    1.40    32.00    11.20
>>  19.64     0.01    3.18   1.55   0.34
>> dm-1              0.00     0.00    1.60    0.00    12.80     0.00
>> 8.00     0.03   21.00   2.62   0.42
>> dm-2              0.00     0.00  339.40 5886.80 66219.20 47092.80
>>  18.20   251.66  184.72   0.10  63.48
>> sdb               1.00  5025.40  264.20  209.20 60992.00 50422.40
>> 235.35     5.98   40.92   1.23  58.28
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>                   16.59   16.34    2.03    9.01    0.00   56.04
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               5.40   320.00   37.40  159.80  2483.20  3529.60
>>  30.49     0.10    0.52   0.39   7.76
>> dm-0              0.00     0.00    0.20    3.60     1.60    28.80
>> 8.00     0.00    0.68   0.68   0.26
>> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
>> 0.00     0.00    0.00   0.00   0.00
>> dm-2              0.00     0.00  287.20 13108.20 53985.60 104864.00
>>  11.86   869.18   48.82   0.06  76.96
>> sdb               5.20 12163.40  238.20  532.00 51235.20 93753.60
>> 188.25    21.46   23.75   0.97  75.08
>>
>>
>>
>> On Tue, Aug 5, 2014 at 1:55 PM, Mark Reddy <mark.re...@boxever.com>
>> wrote:
>>
>>> Hi Ruchir,
>>>
>>> With the large number of blocked flushes and the number of pending
>>> compactions would still indicate IO contention. Can you post the output of
>>> 'iostat -x 5 5'
>>>
>>> If you do in fact have spare IO, there are several configuration options
>>> you can tune such as increasing the number of flush writers and
>>> compaction_throughput_mb_per_sec
>>>
>>> Mark
>>>
>>>
>>> On Tue, Aug 5, 2014 at 5:22 PM, Ruchir Jha <ruchir....@gmail.com> wrote:
>>>
>>>> Also Mark to your comment on my tpstats output, below is my iostat
>>>> output, and the iowait is at 4.59%, which means no IO pressure, but we are
>>>> still seeing the bad flush performance. Should we try increasing the flush
>>>> writers?
>>>>
>>>>
>>>> Linux 2.6.32-358.el6.x86_64 (ny4lpcas13.fusionts.corp)  08/05/2014
>>>>  _x86_64_        (24 CPU)
>>>>
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>                   5.80   10.25    0.65    4.59    0.00   78.72
>>>>
>>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>>> sda             103.83      9630.62     11982.60 3231174328 4020290310
>>>> dm-0             13.57       160.17        81.12   53739546   27217432
>>>> dm-1              7.59        16.94        43.77    5682200   14686784
>>>> dm-2           5792.76     32242.66     45427.12 10817753530 15241278360
>>>> sdb             206.09     22789.19     33569.27 7646015080 11262843224
>>>>
>>>>
>>>>
>>>> On Tue, Aug 5, 2014 at 12:13 PM, Ruchir Jha <ruchir....@gmail.com>
>>>> wrote:
>>>>
>>>>> nodetool status:
>>>>>
>>>>> Datacenter: datacenter1
>>>>> =======================
>>>>> Status=Up/Down
>>>>> |/ State=Normal/Leaving/Joining/Moving
>>>>> --  Address      Load       Tokens  Owns (effective)  Host ID
>>>>>                       Rack
>>>>> UN  10.10.20.27  1.89 TB    256     25.4%
>>>>> 76023cdd-c42d-4068-8b53-ae94584b8b04  rack1
>>>>> UN  10.10.20.62  1.83 TB    256     25.5%
>>>>> 84b47313-da75-4519-94f3-3951d554a3e5  rack1
>>>>> UN  10.10.20.47  1.87 TB    256     24.7%
>>>>> bcd51a92-3150-41ae-9c51-104ea154f6fa  rack1
>>>>> UN  10.10.20.45  1.7 TB     256     22.6%
>>>>> 8d6bce33-8179-4660-8443-2cf822074ca4  rack1
>>>>> UN  10.10.20.15  1.86 TB    256     24.5%
>>>>> 01a01f07-4df2-4c87-98e9-8dd38b3e4aee  rack1
>>>>> UN  10.10.20.31  1.87 TB    256     24.9%
>>>>> 1435acf9-c64d-4bcd-b6a4-abcec209815e  rack1
>>>>> UN  10.10.20.35  1.86 TB    256     25.8%
>>>>> 17cb8772-2444-46ff-8525-33746514727d  rack1
>>>>> UN  10.10.20.51  1.89 TB    256     25.0%
>>>>> 0343cd58-3686-465f-8280-56fb72d161e2  rack1
>>>>> UN  10.10.20.19  1.91 TB    256     25.5%
>>>>> 30ddf003-4d59-4a3e-85fa-e94e4adba1cb  rack1
>>>>> UN  10.10.20.39  1.93 TB    256     26.0%
>>>>> b7d44c26-4d75-4d36-a779-b7e7bdaecbc9  rack1
>>>>> UN  10.10.20.52  1.81 TB    256     25.4%
>>>>> 6b5aca07-1b14-4bc2-a7ba-96f026fa0e4e  rack1
>>>>> UN  10.10.20.22  1.89 TB    256     24.8%
>>>>> 46af9664-8975-4c91-847f-3f7b8f8d5ce2  rack1
>>>>>
>>>>>
>>>>> Note: The new node is not part of the above list.
>>>>>
>>>>> nodetool compactionstats:
>>>>>
>>>>> pending tasks: 1649
>>>>>           compaction type        keyspace   column family
>>>>> completed           total      unit  progress
>>>>>                Compaction           iprod   customerorder
>>>>>  1682804084     17956558077     bytes     9.37%
>>>>>                Compaction            prodgatecustomerorder
>>>>>  1664239271      1693502275     bytes    98.27%
>>>>>                Compaction  qa_config_bkupfixsessionconfig_hist
>>>>>    2443           27253     bytes     8.96%
>>>>>                Compaction            prodgatecustomerorder_hist
>>>>>  1770577280      5026699390     bytes    35.22%
>>>>>                Compaction           iprodgatecustomerorder_hist
>>>>>  2959560205    312350192622     bytes     0.95%
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 5, 2014 at 11:37 AM, Mark Reddy <mark.re...@boxever.com>
>>>>> wrote:
>>>>>
>>>>>> Yes num_tokens is set to 256. initial_token is blank on all nodes
>>>>>>> including the new one.
>>>>>>
>>>>>>
>>>>>> Ok so you have num_tokens set to 256 for all nodes with initial_token
>>>>>> commented out, this means you are using vnodes and the new node will
>>>>>> automatically grab a list of tokens to take over responsibility for.
>>>>>>
>>>>>> Pool Name                    Active   Pending      Completed
>>>>>>> Blocked  All time blocked
>>>>>>> FlushWriter                       0         0           1136
>>>>>>> 0               512
>>>>>>>
>>>>>>> Looks like about 50% of flushes are blocked.
>>>>>>>
>>>>>>
>>>>>> This is a problem as it indicates that the IO system cannot keep up.
>>>>>>
>>>>>> Just ran this on the new node:
>>>>>>> nodetool netstats | grep "Streaming from" | wc -l
>>>>>>> 10
>>>>>>
>>>>>>
>>>>>> This is normal as the new node will most likely take tokens from all
>>>>>> nodes in the cluster.
>>>>>>
>>>>>> Sorry for the multiple updates, but another thing I found was all the
>>>>>>> other existing nodes have themselves in the seeds list, but the new node
>>>>>>> does not have itself in the seeds list. Can that cause this issue?
>>>>>>
>>>>>>
>>>>>> Seeds are only used when a new node is bootstrapping into the cluster
>>>>>> and needs a set of ips to contact and discover the cluster, so this would
>>>>>> have no impact on data sizes or streaming. In general it would be
>>>>>> considered best practice to have a set of 2-3 seeds from each data 
>>>>>> center,
>>>>>> with all nodes having the same seed list.
>>>>>>
>>>>>>
>>>>>> What is the current output of 'nodetool compactionstats'? Could you
>>>>>> also paste the output of nodetool status <keyspace>?
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 5, 2014 at 3:59 PM, Ruchir Jha <ruchir....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sorry for the multiple updates, but another thing I found was all
>>>>>>> the other existing nodes have themselves in the seeds list, but the new
>>>>>>> node does not have itself in the seeds list. Can that cause this issue?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 5, 2014 at 10:30 AM, Ruchir Jha <ruchir....@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Just ran this on the new node:
>>>>>>>>
>>>>>>>> nodetool netstats | grep "Streaming from" | wc -l
>>>>>>>> 10
>>>>>>>>
>>>>>>>> Seems like the new node is receiving data from 10 other nodes. Is
>>>>>>>> that expected in a vnodes enabled environment?
>>>>>>>>
>>>>>>>> Ruchir.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 5, 2014 at 10:21 AM, Ruchir Jha <ruchir....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Also not sure if this is relevant but just noticed the nodetool
>>>>>>>>> tpstats output:
>>>>>>>>>
>>>>>>>>> Pool Name                    Active   Pending      Completed
>>>>>>>>> Blocked  All time blocked
>>>>>>>>> FlushWriter                       0         0           1136
>>>>>>>>>   0               512
>>>>>>>>>
>>>>>>>>> Looks like about 50% of flushes are blocked.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Aug 5, 2014 at 10:14 AM, Ruchir Jha <ruchir....@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yes num_tokens is set to 256. initial_token is blank on all nodes
>>>>>>>>>> including the new one.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 5, 2014 at 10:03 AM, Mark Reddy <
>>>>>>>>>> mark.re...@boxever.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> My understanding was that if initial_token is left empty on the
>>>>>>>>>>>> new node, it just contacts the heaviest node and bisects its token 
>>>>>>>>>>>> range.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> If you are using vnodes and you have num_tokens set to 256 the
>>>>>>>>>>> new node will take token ranges dynamically. What is the 
>>>>>>>>>>> configuration of
>>>>>>>>>>> your other nodes, are you setting num_tokens or initial_token on 
>>>>>>>>>>> those?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Mark
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha <ruchir....@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Patricia for your response!
>>>>>>>>>>>>
>>>>>>>>>>>> On the new node, I just see a lot of the following:
>>>>>>>>>>>>
>>>>>>>>>>>> INFO [FlushWriter:75] 2014-08-05 09:53:04,394 Memtable.java
>>>>>>>>>>>> (line 400) Writing Memtable
>>>>>>>>>>>> INFO [CompactionExecutor:3] 2014-08-05 09:53:11,132
>>>>>>>>>>>> CompactionTask.java (line 262) Compacted 12 sstables to
>>>>>>>>>>>>
>>>>>>>>>>>> so basically it is just busy flushing, and compacting. Would
>>>>>>>>>>>> you have any ideas on why the 2x disk space blow up. My 
>>>>>>>>>>>> understanding was
>>>>>>>>>>>> that if initial_token is left empty on the new node, it just 
>>>>>>>>>>>> contacts the
>>>>>>>>>>>> heaviest node and bisects its token range. And the heaviest node 
>>>>>>>>>>>> is around
>>>>>>>>>>>> 2.1 TB, and the new node is already at 4 TB. Could this be because
>>>>>>>>>>>> compaction is falling behind?
>>>>>>>>>>>>
>>>>>>>>>>>> Ruchir
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 4, 2014 at 7:23 PM, Patricia Gorla <
>>>>>>>>>>>> patri...@thelastpickle.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ruchir,
>>>>>>>>>>>>>
>>>>>>>>>>>>> What exactly are you seeing in the logs? Are you running major
>>>>>>>>>>>>> compactions on the new bootstrapping node?
>>>>>>>>>>>>>
>>>>>>>>>>>>> With respect to the seed list, it is generally advisable to
>>>>>>>>>>>>> use 3 seed nodes per AZ / DC.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Aug 4, 2014 at 11:41 AM, Ruchir Jha <
>>>>>>>>>>>>> ruchir....@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am trying to bootstrap the thirteenth node in a 12 node
>>>>>>>>>>>>>> cluster where the average data size per node is about 2.1 TB. 
>>>>>>>>>>>>>> The bootstrap
>>>>>>>>>>>>>> streaming has been going on for 2 days now, and the disk size on 
>>>>>>>>>>>>>> the new
>>>>>>>>>>>>>> node is already above 4 TB and still going. Is this because the 
>>>>>>>>>>>>>> new node is
>>>>>>>>>>>>>> running major compactions while the streaming is going on?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One thing that I noticed that seemed off was the seeds
>>>>>>>>>>>>>> property in the yaml of the 13th node comprises of 1..12. Where 
>>>>>>>>>>>>>> as the
>>>>>>>>>>>>>> seeds property on the existing 12 nodes consists of all the 
>>>>>>>>>>>>>> other nodes
>>>>>>>>>>>>>> except the thirteenth node. Is this an issue?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any other insight is appreciated?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ruchir.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Patricia Gorla
>>>>>>>>>>>>> @patriciagorla
>>>>>>>>>>>>>
>>>>>>>>>>>>> Consultant
>>>>>>>>>>>>> Apache Cassandra Consulting
>>>>>>>>>>>>> http://www.thelastpickle.com <http://thelastpickle.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Node bootstrap

Reply via email to