Am I the only one who finds it funny that the "ceph problem" was fixed by an update to the disk controller firmware? :-)
Ian On Thu, Sep 3, 2015 at 11:13 AM, Vickey Singh <vickey.singh22...@gmail.com> wrote: > Hey Mark / Community > > These are the sequences of changes that seems to have fixed the ceph > problem > > 1# Upgrading Disk controller firmware from 6.34 to 6.64 ( latest ) > 2# Rebooting all nodes in order to make new firmware into effect > > Read and write operations are now normal as well as system load and > CPU utilization > > - Vickey - > > > On Wed, Sep 2, 2015 at 11:28 PM, Vickey Singh <vickey.singh22...@gmail.com > > wrote: > >> Thank You Mark , please see my response below. >> >> On Wed, Sep 2, 2015 at 5:23 PM, Mark Nelson <mnel...@redhat.com> wrote: >> >>> On 09/02/2015 08:51 AM, Vickey Singh wrote: >>> >>>> Hello Ceph Experts >>>> >>>> I have a strange problem , when i am reading or writing to Ceph pool , >>>> its not writing properly. Please notice Cur MB/s which is going up and >>>> down >>>> >>>> --- Ceph Hammer 0.94.2 >>>> -- CentOS 6, 2.6 >>>> -- Ceph cluster is healthy >>>> >>> >>> You might find that CentOS7 gives you better performance. In some cases >>> we were seeing nearly 2X. >> >> >> Wooo 2X , i would definitely plan for upgrade. Thanks >> >> >>> >>> >>> >>>> >>>> One interesting thing is when every i start rados bench command for read >>>> or write CPU Idle % goes down ~10 and System load is increasing like >>>> anything. >>>> >>>> Hardware >>>> >>>> HpSL4540 >>>> >>> >>> Please make sure the controller is on the newest firmware. There used >>> to be a bug that would cause sequential write performance to bottleneck >>> when writeback cache was enabled on the RAID controller. >> >> >> Last month i have upgraded the firmwares for this hardware , so i hope >> they are up to date. >> >> >>> >>> >>> 32Core CPU >>>> 196G Memory >>>> 10G Network >>>> >>> >>> Be sure to check the network too. We've seen a lot of cases where folks >>> have been burned by one of the NICs acting funky. >>> >> >> At a first view , Interface looks good and they are pushing data nicely ( >> what ever they are getting ) >> >> >>> >>> >>>> I don't think hardware is a problem. >>>> >>>> Please give me clues / pointers , how should i troubleshoot this >>>> problem. >>>> >>>> >>>> >>>> # rados bench -p glance-test 60 write >>>> Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds >>>> or 0 objects >>>> Object prefix: benchmark_data_pouta-s01.pouta.csc.fi_2173350 >>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg >>>> lat >>>> 0 0 0 0 0 0 - >>>> 0 >>>> 1 16 20 4 15.99 16 0.12308 >>>> 0.10001 >>>> 2 16 37 21 41.9841 68 1.79104 >>>> 0.827021 >>>> 3 16 68 52 69.3122 124 0.084304 >>>> 0.854829 >>>> 4 16 114 98 97.9746 184 0.12285 >>>> 0.614507 >>>> 5 16 188 172 137.568 296 0.210669 >>>> 0.449784 >>>> 6 16 248 232 154.634 240 0.090418 >>>> 0.390647 >>>> 7 16 305 289 165.11 228 0.069769 >>>> 0.347957 >>>> 8 16 331 315 157.471 104 0.026247 >>>> 0.3345 >>>> 9 16 361 345 153.306 120 0.082861 >>>> 0.320711 >>>> 10 16 380 364 145.575 76 0.027964 >>>> 0.310004 >>>> 11 16 393 377 137.067 52 3.73332 >>>> 0.393318 >>>> 12 16 448 432 143.971 220 0.334664 >>>> 0.415606 >>>> 13 16 476 460 141.508 112 0.271096 >>>> 0.406574 >>>> 14 16 497 481 137.399 84 0.257794 >>>> 0.412006 >>>> 15 16 507 491 130.906 40 1.49351 >>>> 0.428057 >>>> 16 16 529 513 115.042 88 0.399384 >>>> 0.48009 >>>> 17 16 533 517 94.6286 16 5.50641 >>>> 0.507804 >>>> 18 16 537 521 83.405 16 4.42682 >>>> 0.549951 >>>> 19 16 538 522 80.349 4 11.2052 >>>> 0.570363 >>>> 2015-09-02 09:26:18.398641min lat: 0.023851 max lat: 11.2052 avg lat: >>>> 0.570363 >>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg >>>> lat >>>> 20 16 538 522 77.3611 0 - >>>> 0.570363 >>>> 21 16 540 524 74.8825 4 8.88847 >>>> 0.591767 >>>> 22 16 542 526 72.5748 8 1.41627 >>>> 0.593555 >>>> 23 16 543 527 70.2873 4 8.0856 >>>> 0.607771 >>>> 24 16 555 539 69.5674 48 0.145199 >>>> 0.781685 >>>> 25 16 560 544 68.0177 20 1.4342 >>>> 0.787017 >>>> 26 16 564 548 66.4241 16 0.451905 >>>> 0.78765 >>>> 27 16 566 550 64.7055 8 0.611129 >>>> 0.787898 >>>> 28 16 570 554 63.3138 16 2.51086 >>>> 0.797067 >>>> 29 16 570 554 61.5549 0 - >>>> 0.797067 >>>> 30 16 572 556 60.1071 4 7.71382 >>>> 0.830697 >>>> 31 16 577 561 59.0515 20 23.3501 >>>> 0.916368 >>>> 32 16 590 574 58.8705 52 0.336684 >>>> 0.956958 >>>> 33 16 591 575 57.4986 4 1.92811 >>>> 0.958647 >>>> 34 16 591 575 56.0961 0 - >>>> 0.958647 >>>> 35 16 591 575 54.7603 0 - >>>> 0.958647 >>>> 36 16 597 581 54.0447 8 0.187351 >>>> 1.00313 >>>> 37 16 625 609 52.8394 112 2.12256 >>>> 1.09256 >>>> 38 16 631 615 52.227 24 1.57413 >>>> 1.10206 >>>> 39 16 638 622 51.7232 28 4.41663 >>>> 1.15086 >>>> 2015-09-02 09:26:40.510623min lat: 0.023851 max lat: 27.6704 avg lat: >>>> 1.15657 >>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg >>>> lat >>>> 40 16 652 636 51.8102 56 0.113345 >>>> 1.15657 >>>> 41 16 682 666 53.1443 120 0.041251 >>>> 1.17813 >>>> 42 16 685 669 52.3395 12 0.501285 >>>> 1.17421 >>>> 43 15 690 675 51.7955 24 2.26605 >>>> 1.18357 >>>> 44 16 728 712 53.6062 148 0.589826 >>>> 1.17478 >>>> 45 16 728 712 52.6158 0 - >>>> 1.17478 >>>> 46 16 728 712 51.6613 0 - >>>> 1.17478 >>>> 47 16 728 712 50.7407 0 - >>>> 1.17478 >>>> 48 16 772 756 52.9332 44 0.234811 >>>> 1.1946 >>>> 49 16 835 819 56.3577 252 5.67087 >>>> 1.12063 >>>> 50 16 890 874 59.1252 220 0.230806 >>>> 1.06778 >>>> 51 16 896 880 58.5409 24 0.382471 >>>> 1.06121 >>>> 52 16 896 880 57.5832 0 - >>>> 1.06121 >>>> 53 16 896 880 56.6562 0 - >>>> 1.06121 >>>> 54 16 896 880 55.7587 0 - >>>> 1.06121 >>>> 55 16 897 881 54.9515 1 4.88333 >>>> 1.06554 >>>> 56 16 897 881 54.1077 0 - >>>> 1.06554 >>>> 57 16 897 881 53.2894 0 - >>>> 1.06554 >>>> 58 16 897 881 51.9335 0 - >>>> 1.06554 >>>> 59 16 897 881 51.1792 0 - >>>> 1.06554 >>>> 2015-09-02 09:27:01.267301min lat: 0.01405 max lat: 27.6704 avg lat: >>>> 1.06554 >>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg >>>> lat >>>> 60 16 897 881 50.4445 0 - >>>> 1.06554 >>>> >>>> >>>> >>>> >>>> >>>> cluster 98d89661-f616-49eb-9ccf-84d720e179c0 >>>> health HEALTH_OK >>>> monmap e3: 3 mons at >>>> {s01=10.100.50.1:6789/0,s02=10.100.50.2:6789/0,s03=1 >>>> <http://10.100.50.1:6789/0,s02=10.100.50.2:6789/0,s03=1> >>>> 0.100.50.3:6789/0 <http://0.100.50.3:6789/0>}, election epoch 666, >>>> quorum 0,1,2 s01,s02,s03 >>>> * osdmap e121039: 240 osds: 240 up, 240 in* >>>> pgmap v850698: 7232 pgs, 31 pools, 439 GB data, 43090 kobjects >>>> 2635 GB used, 867 TB / 870 TB avail >>>> 7226 active+clean >>>> 6 active+clean+scrubbing+deep >>>> >>> >>> Note the last line there. You'll likely want to try your test again >>> when scrubbing is complete. Also, you may want to try this script: >>> >> >> Yeah i have tried few times when cluster is perfectly healthy ( not doing >> scrubbing / repairs ) >> >> >>> >>> https://github.com/ceph/cbt/blob/master/tools/readpgdump.py >>> >>> You can invoke it like: >>> >>> ceph pg dump | ./readpgdump.py >>> >>> That will give you a bunch of information about the pools on your >>> system. I'm a little concerned about how many PGs your glance-test pool >>> may have given your totals above. >>> >> >> Thanks for the link i would do that and also run rados bench for other >> pools ( where PG is higher ) >> >> >> Now here are my some observations >> >> 1# When the cluster is not doing anything , Health_ok , with no >> background scrubbing / repairing. Also all system resources CPU/MEM/NET are >> mostly idle. In this Case when i start rados bench ( write / rand / seq ) , >> after suddenly a few seconds >> >> --- rados bench output drops from ~500M to few 10M >> --- At the same time CPU busy 90% and System load bumps UP >> >> Once rados bench completes >> >> --- After few minutes System resources becomes IDLE >> >> 2# Sometime some PG becomes unclean for a few minutes while rados bench >> runs and then then quickly they becomes active+clean >> >> >> I am out of clues , so any help from community that leads me to think in >> right direction , would be helpful. >> >> >> - Vickey - >> >> >>> >>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Ian R. Colle Global Director of Software Engineering Red Hat, Inc. ico...@redhat.com +1-303-601-7713 http://www.linkedin.com/in/ircolle http://www.twitter.com/ircolle
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com