Re: Node bootstrap

Ruchir Jha Tue, 05 Aug 2014 12:00:27 -0700

Also, right now the "top" command shows that we are at 500-700% CPU, and we
have 23 total processors, which means we have a lot of idle CPU left over,
so throwing more threads at compaction and flush should alleviate the
problem?



On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha <ruchir....@gmail.com> wrote:

>
> Right now, we have 6 flush writers and compaction_throughput_mb_per_sec is
> set to 0, which I believe disables throttling.
>
> Also, Here is the iostat -x 5 5 output:
>
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sda              10.00  1450.35   50.79   55.92  9775.97 12030.14   204.34
>     1.56   14.62   1.05  11.21
> dm-0              0.00     0.00    3.59   18.82   166.52   150.35    14.14
>     0.44   19.49   0.54   1.22
> dm-1              0.00     0.00    2.32    5.37    18.56    42.98     8.00
>     0.76   98.82   0.43   0.33
> dm-2              0.00     0.00  162.17 5836.66 32714.46 47040.87    13.30
>     5.57    0.90   0.06  36.00
> sdb               0.40  4251.90  106.72  107.35 23123.61 35204.09   272.46
>     4.43   20.68   1.29  27.64
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                 14.64   10.75    1.81   13.50    0.00   59.29
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sda              15.40  1344.60   68.80  145.60  4964.80 11790.40    78.15
>     0.38    1.80   0.80  17.10
> dm-0              0.00     0.00   43.00 1186.20  2292.80  9489.60     9.59
>     4.88    3.90   0.09  11.58
> dm-1              0.00     0.00    1.60    0.00    12.80     0.00     8.00
>     0.03   16.00   2.00   0.32
> dm-2              0.00     0.00  197.20 17583.80 35152.00 140664.00
> 9.89  2847.50  109.52   0.05  93.50
> sdb              13.20 16552.20  159.00  742.20 32745.60 129129.60
> 179.62    72.88   66.01   1.04  93.42
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                   15.51   19.77    1.97    5.02    0.00   57.73
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sda              16.20   523.40   60.00  285.00  5220.80  5913.60    32.27
>     0.25    0.72   0.60  20.86
> dm-0              0.00     0.00    0.80    1.40    32.00    11.20    19.64
>     0.01    3.18   1.55   0.34
> dm-1              0.00     0.00    1.60    0.00    12.80     0.00     8.00
>     0.03   21.00   2.62   0.42
> dm-2              0.00     0.00  339.40 5886.80 66219.20 47092.80    18.20
>   251.66  184.72   0.10  63.48
> sdb               1.00  5025.40  264.20  209.20 60992.00 50422.40   235.35
>     5.98   40.92   1.23  58.28
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                   16.59   16.34    2.03    9.01    0.00   56.04
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sda               5.40   320.00   37.40  159.80  2483.20  3529.60    30.49
>     0.10    0.52   0.39   7.76
> dm-0              0.00     0.00    0.20    3.60     1.60    28.80     8.00
>     0.00    0.68   0.68   0.26
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00
>     0.00    0.00   0.00   0.00
> dm-2              0.00     0.00  287.20 13108.20 53985.60 104864.00
>  11.86   869.18   48.82   0.06  76.96
> sdb               5.20 12163.40  238.20  532.00 51235.20 93753.60   188.25
>    21.46   23.75   0.97  75.08
>
>
>
> On Tue, Aug 5, 2014 at 1:55 PM, Mark Reddy <mark.re...@boxever.com> wrote:
>
>> Hi Ruchir,
>>
>> With the large number of blocked flushes and the number of pending
>> compactions would still indicate IO contention. Can you post the output of
>> 'iostat -x 5 5'
>>
>> If you do in fact have spare IO, there are several configuration options
>> you can tune such as increasing the number of flush writers and
>> compaction_throughput_mb_per_sec
>>
>> Mark
>>
>>
>> On Tue, Aug 5, 2014 at 5:22 PM, Ruchir Jha <ruchir....@gmail.com> wrote:
>>
>>> Also Mark to your comment on my tpstats output, below is my iostat
>>> output, and the iowait is at 4.59%, which means no IO pressure, but we are
>>> still seeing the bad flush performance. Should we try increasing the flush
>>> writers?
>>>
>>>
>>> Linux 2.6.32-358.el6.x86_64 (ny4lpcas13.fusionts.corp)  08/05/2014
>>>  _x86_64_        (24 CPU)
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>                   5.80   10.25    0.65    4.59    0.00   78.72
>>>
>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>> sda             103.83      9630.62     11982.60 3231174328 4020290310
>>> dm-0             13.57       160.17        81.12   53739546   27217432
>>> dm-1              7.59        16.94        43.77    5682200   14686784
>>> dm-2           5792.76     32242.66     45427.12 10817753530 15241278360
>>> sdb             206.09     22789.19     33569.27 7646015080 11262843224
>>>
>>>
>>>
>>> On Tue, Aug 5, 2014 at 12:13 PM, Ruchir Jha <ruchir....@gmail.com>
>>> wrote:
>>>
>>>> nodetool status:
>>>>
>>>> Datacenter: datacenter1
>>>> =======================
>>>> Status=Up/Down
>>>> |/ State=Normal/Leaving/Joining/Moving
>>>> --  Address      Load       Tokens  Owns (effective)  Host ID
>>>>                     Rack
>>>> UN  10.10.20.27  1.89 TB    256     25.4%
>>>> 76023cdd-c42d-4068-8b53-ae94584b8b04  rack1
>>>> UN  10.10.20.62  1.83 TB    256     25.5%
>>>> 84b47313-da75-4519-94f3-3951d554a3e5  rack1
>>>> UN  10.10.20.47  1.87 TB    256     24.7%
>>>> bcd51a92-3150-41ae-9c51-104ea154f6fa  rack1
>>>> UN  10.10.20.45  1.7 TB     256     22.6%
>>>> 8d6bce33-8179-4660-8443-2cf822074ca4  rack1
>>>> UN  10.10.20.15  1.86 TB    256     24.5%
>>>> 01a01f07-4df2-4c87-98e9-8dd38b3e4aee  rack1
>>>> UN  10.10.20.31  1.87 TB    256     24.9%
>>>> 1435acf9-c64d-4bcd-b6a4-abcec209815e  rack1
>>>> UN  10.10.20.35  1.86 TB    256     25.8%
>>>> 17cb8772-2444-46ff-8525-33746514727d  rack1
>>>> UN  10.10.20.51  1.89 TB    256     25.0%
>>>> 0343cd58-3686-465f-8280-56fb72d161e2  rack1
>>>> UN  10.10.20.19  1.91 TB    256     25.5%
>>>> 30ddf003-4d59-4a3e-85fa-e94e4adba1cb  rack1
>>>> UN  10.10.20.39  1.93 TB    256     26.0%
>>>> b7d44c26-4d75-4d36-a779-b7e7bdaecbc9  rack1
>>>> UN  10.10.20.52  1.81 TB    256     25.4%
>>>> 6b5aca07-1b14-4bc2-a7ba-96f026fa0e4e  rack1
>>>> UN  10.10.20.22  1.89 TB    256     24.8%
>>>> 46af9664-8975-4c91-847f-3f7b8f8d5ce2  rack1
>>>>
>>>>
>>>> Note: The new node is not part of the above list.
>>>>
>>>> nodetool compactionstats:
>>>>
>>>> pending tasks: 1649
>>>>           compaction type        keyspace   column family
>>>> completed           total      unit  progress
>>>>                Compaction           iprod   customerorder
>>>>  1682804084     17956558077     bytes     9.37%
>>>>                Compaction            prodgatecustomerorder
>>>>  1664239271      1693502275     bytes    98.27%
>>>>                Compaction  qa_config_bkupfixsessionconfig_hist
>>>>    2443           27253     bytes     8.96%
>>>>                Compaction            prodgatecustomerorder_hist
>>>>  1770577280      5026699390     bytes    35.22%
>>>>                Compaction           iprodgatecustomerorder_hist
>>>>  2959560205    312350192622     bytes     0.95%
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 5, 2014 at 11:37 AM, Mark Reddy <mark.re...@boxever.com>
>>>> wrote:
>>>>
>>>>> Yes num_tokens is set to 256. initial_token is blank on all nodes
>>>>>> including the new one.
>>>>>
>>>>>
>>>>> Ok so you have num_tokens set to 256 for all nodes with initial_token
>>>>> commented out, this means you are using vnodes and the new node will
>>>>> automatically grab a list of tokens to take over responsibility for.
>>>>>
>>>>> Pool Name                    Active   Pending      Completed   Blocked
>>>>>>  All time blocked
>>>>>> FlushWriter                       0         0           1136
>>>>>> 0               512
>>>>>>
>>>>>> Looks like about 50% of flushes are blocked.
>>>>>>
>>>>>
>>>>> This is a problem as it indicates that the IO system cannot keep up.
>>>>>
>>>>> Just ran this on the new node:
>>>>>> nodetool netstats | grep "Streaming from" | wc -l
>>>>>> 10
>>>>>
>>>>>
>>>>> This is normal as the new node will most likely take tokens from all
>>>>> nodes in the cluster.
>>>>>
>>>>> Sorry for the multiple updates, but another thing I found was all the
>>>>>> other existing nodes have themselves in the seeds list, but the new node
>>>>>> does not have itself in the seeds list. Can that cause this issue?
>>>>>
>>>>>
>>>>> Seeds are only used when a new node is bootstrapping into the cluster
>>>>> and needs a set of ips to contact and discover the cluster, so this would
>>>>> have no impact on data sizes or streaming. In general it would be
>>>>> considered best practice to have a set of 2-3 seeds from each data center,
>>>>> with all nodes having the same seed list.
>>>>>
>>>>>
>>>>> What is the current output of 'nodetool compactionstats'? Could you
>>>>> also paste the output of nodetool status <keyspace>?
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 5, 2014 at 3:59 PM, Ruchir Jha <ruchir....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Sorry for the multiple updates, but another thing I found was all the
>>>>>> other existing nodes have themselves in the seeds list, but the new node
>>>>>> does not have itself in the seeds list. Can that cause this issue?
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 5, 2014 at 10:30 AM, Ruchir Jha <ruchir....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Just ran this on the new node:
>>>>>>>
>>>>>>> nodetool netstats | grep "Streaming from" | wc -l
>>>>>>> 10
>>>>>>>
>>>>>>> Seems like the new node is receiving data from 10 other nodes. Is
>>>>>>> that expected in a vnodes enabled environment?
>>>>>>>
>>>>>>> Ruchir.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 5, 2014 at 10:21 AM, Ruchir Jha <ruchir....@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Also not sure if this is relevant but just noticed the nodetool
>>>>>>>> tpstats output:
>>>>>>>>
>>>>>>>> Pool Name                    Active   Pending      Completed
>>>>>>>> Blocked  All time blocked
>>>>>>>> FlushWriter                       0         0           1136
>>>>>>>>   0               512
>>>>>>>>
>>>>>>>> Looks like about 50% of flushes are blocked.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 5, 2014 at 10:14 AM, Ruchir Jha <ruchir....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes num_tokens is set to 256. initial_token is blank on all nodes
>>>>>>>>> including the new one.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Aug 5, 2014 at 10:03 AM, Mark Reddy <
>>>>>>>>> mark.re...@boxever.com> wrote:
>>>>>>>>>
>>>>>>>>>> My understanding was that if initial_token is left empty on the
>>>>>>>>>>> new node, it just contacts the heaviest node and bisects its token 
>>>>>>>>>>> range.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If you are using vnodes and you have num_tokens set to 256 the
>>>>>>>>>> new node will take token ranges dynamically. What is the 
>>>>>>>>>> configuration of
>>>>>>>>>> your other nodes, are you setting num_tokens or initial_token on 
>>>>>>>>>> those?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha <ruchir....@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Patricia for your response!
>>>>>>>>>>>
>>>>>>>>>>> On the new node, I just see a lot of the following:
>>>>>>>>>>>
>>>>>>>>>>> INFO [FlushWriter:75] 2014-08-05 09:53:04,394 Memtable.java
>>>>>>>>>>> (line 400) Writing Memtable
>>>>>>>>>>> INFO [CompactionExecutor:3] 2014-08-05 09:53:11,132
>>>>>>>>>>> CompactionTask.java (line 262) Compacted 12 sstables to
>>>>>>>>>>>
>>>>>>>>>>> so basically it is just busy flushing, and compacting. Would you
>>>>>>>>>>> have any ideas on why the 2x disk space blow up. My understanding 
>>>>>>>>>>> was that
>>>>>>>>>>> if initial_token is left empty on the new node, it just contacts the
>>>>>>>>>>> heaviest node and bisects its token range. And the heaviest node is 
>>>>>>>>>>> around
>>>>>>>>>>> 2.1 TB, and the new node is already at 4 TB. Could this be because
>>>>>>>>>>> compaction is falling behind?
>>>>>>>>>>>
>>>>>>>>>>> Ruchir
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 4, 2014 at 7:23 PM, Patricia Gorla <
>>>>>>>>>>> patri...@thelastpickle.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ruchir,
>>>>>>>>>>>>
>>>>>>>>>>>> What exactly are you seeing in the logs? Are you running major
>>>>>>>>>>>> compactions on the new bootstrapping node?
>>>>>>>>>>>>
>>>>>>>>>>>> With respect to the seed list, it is generally advisable to use
>>>>>>>>>>>> 3 seed nodes per AZ / DC.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 4, 2014 at 11:41 AM, Ruchir Jha <
>>>>>>>>>>>> ruchir....@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I am trying to bootstrap the thirteenth node in a 12 node
>>>>>>>>>>>>> cluster where the average data size per node is about 2.1 TB. The 
>>>>>>>>>>>>> bootstrap
>>>>>>>>>>>>> streaming has been going on for 2 days now, and the disk size on 
>>>>>>>>>>>>> the new
>>>>>>>>>>>>> node is already above 4 TB and still going. Is this because the 
>>>>>>>>>>>>> new node is
>>>>>>>>>>>>> running major compactions while the streaming is going on?
>>>>>>>>>>>>>
>>>>>>>>>>>>> One thing that I noticed that seemed off was the seeds
>>>>>>>>>>>>> property in the yaml of the 13th node comprises of 1..12. Where 
>>>>>>>>>>>>> as the
>>>>>>>>>>>>> seeds property on the existing 12 nodes consists of all the other 
>>>>>>>>>>>>> nodes
>>>>>>>>>>>>> except the thirteenth node. Is this an issue?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any other insight is appreciated?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ruchir.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Patricia Gorla
>>>>>>>>>>>> @patriciagorla
>>>>>>>>>>>>
>>>>>>>>>>>> Consultant
>>>>>>>>>>>> Apache Cassandra Consulting
>>>>>>>>>>>> http://www.thelastpickle.com <http://thelastpickle.com>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Node bootstrap

Reply via email to