Re: [ceph-users] Ceph read / write : Terrible performance

Ian Colle Thu, 03 Sep 2015 10:17:21 -0700

Am I the only one who finds it funny that the "ceph problem" was fixed by
an update to the disk controller firmware? :-)


Ian

On Thu, Sep 3, 2015 at 11:13 AM, Vickey Singh <vickey.singh22...@gmail.com>
wrote:

> Hey Mark / Community
>
> These are the sequences of changes that seems to have fixed the ceph
> problem
>
> 1#  Upgrading Disk controller firmware from 6.34 to 6.64  ( latest )
> 2# Rebooting all nodes in order to make new firmware into effect
>
> Read and write operations are now normal as well as system load and
> CPU utilization
>
> - Vickey -
>
>
> On Wed, Sep 2, 2015 at 11:28 PM, Vickey Singh <vickey.singh22...@gmail.com
> > wrote:
>
>> Thank You Mark , please see my response below.
>>
>> On Wed, Sep 2, 2015 at 5:23 PM, Mark Nelson <mnel...@redhat.com> wrote:
>>
>>> On 09/02/2015 08:51 AM, Vickey Singh wrote:
>>>
>>>> Hello Ceph Experts
>>>>
>>>> I have a strange problem , when i am reading or writing to Ceph pool ,
>>>> its not writing properly. Please notice Cur MB/s which is going up and
>>>> down
>>>>
>>>> --- Ceph Hammer 0.94.2
>>>> -- CentOS 6, 2.6
>>>> -- Ceph cluster is healthy
>>>>
>>>
>>> You might find that CentOS7 gives you better performance.  In some cases
>>> we were seeing nearly 2X.
>>
>>
>> Wooo 2X , i would definitely plan for upgrade. Thanks
>>
>>
>>>
>>>
>>>
>>>>
>>>> One interesting thing is when every i start rados bench command for read
>>>> or write CPU Idle % goes down ~10 and System load is increasing like
>>>> anything.
>>>>
>>>> Hardware
>>>>
>>>> HpSL4540
>>>>
>>>
>>> Please make sure the controller is on the newest firmware.  There used
>>> to be a bug that would cause sequential write performance to bottleneck
>>> when writeback cache was enabled on the RAID controller.
>>
>>
>> Last month i have upgraded the firmwares for this hardware , so i hope
>> they are up to date.
>>
>>
>>>
>>>
>>> 32Core CPU
>>>> 196G Memory
>>>> 10G Network
>>>>
>>>
>>> Be sure to check the network too.  We've seen a lot of cases where folks
>>> have been burned by one of the NICs acting funky.
>>>
>>
>> At a first view , Interface looks good and they are pushing data nicely (
>> what ever they are getting )
>>
>>
>>>
>>>
>>>> I don't think hardware is a problem.
>>>>
>>>> Please give me clues / pointers , how should i troubleshoot this
>>>> problem.
>>>>
>>>>
>>>>
>>>> # rados bench -p glance-test 60 write
>>>>   Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds
>>>> or 0 objects
>>>>   Object prefix: benchmark_data_pouta-s01.pouta.csc.fi_2173350
>>>>     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>>>> lat
>>>>       0       0         0         0         0         0         -
>>>>    0
>>>>       1      16        20         4     15.99        16   0.12308
>>>>  0.10001
>>>>       2      16        37        21   41.9841        68   1.79104
>>>> 0.827021
>>>>       3      16        68        52   69.3122       124  0.084304
>>>> 0.854829
>>>>       4      16       114        98   97.9746       184   0.12285
>>>> 0.614507
>>>>       5      16       188       172   137.568       296  0.210669
>>>> 0.449784
>>>>       6      16       248       232   154.634       240  0.090418
>>>> 0.390647
>>>>       7      16       305       289    165.11       228  0.069769
>>>> 0.347957
>>>>       8      16       331       315   157.471       104  0.026247
>>>> 0.3345
>>>>       9      16       361       345   153.306       120  0.082861
>>>> 0.320711
>>>>      10      16       380       364   145.575        76  0.027964
>>>> 0.310004
>>>>      11      16       393       377   137.067        52   3.73332
>>>> 0.393318
>>>>      12      16       448       432   143.971       220  0.334664
>>>> 0.415606
>>>>      13      16       476       460   141.508       112  0.271096
>>>> 0.406574
>>>>      14      16       497       481   137.399        84  0.257794
>>>> 0.412006
>>>>      15      16       507       491   130.906        40   1.49351
>>>> 0.428057
>>>>      16      16       529       513   115.042        88  0.399384
>>>>  0.48009
>>>>      17      16       533       517   94.6286        16   5.50641
>>>> 0.507804
>>>>      18      16       537       521    83.405        16   4.42682
>>>> 0.549951
>>>>      19      16       538       522    80.349         4   11.2052
>>>> 0.570363
>>>> 2015-09-02 09:26:18.398641min lat: 0.023851 max lat: 11.2052 avg lat:
>>>> 0.570363
>>>>     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>>>> lat
>>>>      20      16       538       522   77.3611         0         -
>>>> 0.570363
>>>>      21      16       540       524   74.8825         4   8.88847
>>>> 0.591767
>>>>      22      16       542       526   72.5748         8   1.41627
>>>> 0.593555
>>>>      23      16       543       527   70.2873         4    8.0856
>>>> 0.607771
>>>>      24      16       555       539   69.5674        48  0.145199
>>>> 0.781685
>>>>      25      16       560       544   68.0177        20    1.4342
>>>> 0.787017
>>>>      26      16       564       548   66.4241        16  0.451905
>>>>  0.78765
>>>>      27      16       566       550   64.7055         8  0.611129
>>>> 0.787898
>>>>      28      16       570       554   63.3138        16   2.51086
>>>> 0.797067
>>>>      29      16       570       554   61.5549         0         -
>>>> 0.797067
>>>>      30      16       572       556   60.1071         4   7.71382
>>>> 0.830697
>>>>      31      16       577       561   59.0515        20   23.3501
>>>> 0.916368
>>>>      32      16       590       574   58.8705        52  0.336684
>>>> 0.956958
>>>>      33      16       591       575   57.4986         4   1.92811
>>>> 0.958647
>>>>      34      16       591       575   56.0961         0         -
>>>> 0.958647
>>>>      35      16       591       575   54.7603         0         -
>>>> 0.958647
>>>>      36      16       597       581   54.0447         8  0.187351
>>>>  1.00313
>>>>      37      16       625       609   52.8394       112   2.12256
>>>>  1.09256
>>>>      38      16       631       615    52.227        24   1.57413
>>>>  1.10206
>>>>      39      16       638       622   51.7232        28   4.41663
>>>>  1.15086
>>>> 2015-09-02 09:26:40.510623min lat: 0.023851 max lat: 27.6704 avg lat:
>>>> 1.15657
>>>>     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>>>> lat
>>>>      40      16       652       636   51.8102        56  0.113345
>>>>  1.15657
>>>>      41      16       682       666   53.1443       120  0.041251
>>>>  1.17813
>>>>      42      16       685       669   52.3395        12  0.501285
>>>>  1.17421
>>>>      43      15       690       675   51.7955        24   2.26605
>>>>  1.18357
>>>>      44      16       728       712   53.6062       148  0.589826
>>>>  1.17478
>>>>      45      16       728       712   52.6158         0         -
>>>>  1.17478
>>>>      46      16       728       712   51.6613         0         -
>>>>  1.17478
>>>>      47      16       728       712   50.7407         0         -
>>>>  1.17478
>>>>      48      16       772       756   52.9332        44  0.234811
>>>> 1.1946
>>>>      49      16       835       819   56.3577       252   5.67087
>>>>  1.12063
>>>>      50      16       890       874   59.1252       220  0.230806
>>>>  1.06778
>>>>      51      16       896       880   58.5409        24  0.382471
>>>>  1.06121
>>>>      52      16       896       880   57.5832         0         -
>>>>  1.06121
>>>>      53      16       896       880   56.6562         0         -
>>>>  1.06121
>>>>      54      16       896       880   55.7587         0         -
>>>>  1.06121
>>>>      55      16       897       881   54.9515         1   4.88333
>>>>  1.06554
>>>>      56      16       897       881   54.1077         0         -
>>>>  1.06554
>>>>      57      16       897       881   53.2894         0         -
>>>>  1.06554
>>>>      58      16       897       881   51.9335         0         -
>>>>  1.06554
>>>>      59      16       897       881   51.1792         0         -
>>>>  1.06554
>>>> 2015-09-02 09:27:01.267301min lat: 0.01405 max lat: 27.6704 avg lat:
>>>> 1.06554
>>>>     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>>>> lat
>>>>      60      16       897       881   50.4445         0         -
>>>>  1.06554
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>      cluster 98d89661-f616-49eb-9ccf-84d720e179c0
>>>>       health HEALTH_OK
>>>>       monmap e3: 3 mons at
>>>> {s01=10.100.50.1:6789/0,s02=10.100.50.2:6789/0,s03=1
>>>> <http://10.100.50.1:6789/0,s02=10.100.50.2:6789/0,s03=1>
>>>> 0.100.50.3:6789/0 <http://0.100.50.3:6789/0>}, election epoch 666,
>>>> quorum 0,1,2 s01,s02,s03
>>>> *     osdmap e121039: 240 osds: 240 up, 240 in*
>>>>        pgmap v850698: 7232 pgs, 31 pools, 439 GB data, 43090 kobjects
>>>>              2635 GB used, 867 TB / 870 TB avail
>>>>                  7226 active+clean
>>>>                     6 active+clean+scrubbing+deep
>>>>
>>>
>>> Note the last line there.  You'll likely want to try your test again
>>> when scrubbing is complete.  Also, you may want to try this script:
>>>
>>
>> Yeah i have tried few times when cluster is perfectly healthy ( not doing
>> scrubbing / repairs )
>>
>>
>>>
>>> https://github.com/ceph/cbt/blob/master/tools/readpgdump.py
>>>
>>> You can invoke it like:
>>>
>>> ceph pg dump | ./readpgdump.py
>>>
>>> That will give you a bunch of information about the pools on your
>>> system.  I'm a little concerned about how many PGs your glance-test pool
>>> may have given your totals above.
>>>
>>
>> Thanks for the link i would do that and also run rados bench for other
>> pools ( where PG is higher )
>>
>>
>> Now here are my some observations
>>
>> 1#  When the cluster is not doing anything , Health_ok , with no
>> background scrubbing / repairing. Also all system resources CPU/MEM/NET are
>> mostly idle. In this Case when i start rados bench ( write / rand / seq ) ,
>> after suddenly a few seconds
>>
>>        --- rados bench output drops from ~500M to few 10M
>>       --- At the same time CPU busy 90%  and System load bumps UP
>>
>> Once rados bench completes
>>
>>      --- After few minutes System resources  becomes IDLE
>>
>> 2#   Sometime some PG becomes unclean for a few minutes while rados bench
>> runs and then then quickly they becomes active+clean
>>
>>
>> I am out of clues , so any help from community that leads me to think in
>> right direction , would be helpful.
>>
>>
>> - Vickey -
>>
>>
>>>
>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Ian R. Colle
Global Director of Software Engineering
Red Hat, Inc.
ico...@redhat.com
+1-303-601-7713
http://www.linkedin.com/in/ircolle
http://www.twitter.com/ircolle

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph read / write : Terrible performance

Reply via email to