First and foremost, have you checked your disk controller. Of most import would be your cache battery. Any time I have a single node acting up, the controller is Suspect #1.
On Thu, Apr 5, 2018 at 11:23 AM Steven Vacaroaia <ste...@gmail.com> wrote: > Hi, > > I have a strange issue - OSDs from a specific server are introducing huge > performance issue > > This is a brand new installation on 3 identical servers - > DELL R620 with PERC H710 , bluestore DB and WAL on SSD, 10GB dedicated > private/public networks > > > When I add the OSD I see gaps like below and huge latency > > atop provides no clear culprit EXCEPT very low network and specific disk > utilization BUT 100% DSK for ceph-osd process which stay like that ( 100%) > for the duration of the test > ( see below) > > Not sure why ceph-osd process DSK stays at 100% while all the specific > DSK ( for sdb, sde ..etc) are 1% busy ? > > Any help/ instructions for how to troubleshooting this will be appreciated > > (apologies if the format is not being kept) > > > CPU | sys 4% | user 1% | | irq 1% | > | idle 794% | wait 0% | | | > steal 0% | guest 0% | curf 2.20GHz | | curscal > ?% | > CPL | avg1 0.00 | | avg5 0.00 | avg15 0.00 | > | | | csw 547/s | | > intr 832/s | | | numcpu 8 | > | > MEM | tot 62.9G | free 61.4G | cache 520.6M | dirty 0.0M | > buff 7.5M | slab 98.9M | slrec 64.8M | shmem 8.8M | shrss > 0.0M | shswp 0.0M | vmbal 0.0M | | hptot 0.0M | > hpuse 0.0M | > SWP | tot 6.0G | free 6.0G | | | > | | | | | > | | vmcom 1.5G | | vmlim > 37.4G | > LVM | dm-0 | busy 1% | | read 0/s | > write 54/s | | KiB/r 0 | KiB/w 455 | MBr/s > 0.0 | | MBw/s 24.0 | avq 3.69 | | avio > 0.14 ms | > DSK | sdb | busy 1% | | read 0/s | > write 102/s | | KiB/r 0 | KiB/w 240 | MBr/s > 0.0 | | MBw/s 24.0 | avq 6.69 | | avio > 0.08 ms | > DSK | sda | busy 0% | | read 0/s | > write 12/s | | KiB/r 0 | KiB/w 4 | MBr/s > 0.0 | | MBw/s 0.1 | avq 1.00 | | avio > 0.05 ms | > DSK | sde | busy 0% | | read 0/s | > write 0/s | | KiB/r 0 | KiB/w 0 | MBr/s > 0.0 | | MBw/s 0.0 | avq 1.00 | | avio > 2.50 ms | > NET | transport | tcpi 718/s | tcpo 972/s | udpi 0/s | > | udpo 0/s | tcpao 0/s | tcppo 0/s | tcprs 21/s | > tcpie 0/s | tcpor 0/s | | udpnp 0/s | udpie > 0/s | > NET | network | ipi 719/s | | ipo 399/s | > ipfrw 0/s | | deliv 719/s | | > | | | icmpi 0/s | | icmpo > 0/s | > NET | eth5 1% | pcki 2214/s | pcko 939/s | | sp > 10 Gbps | si 154 Mbps | so 52 Mbps | | coll 0/s | > mlti 0/s | erri 0/s | erro 0/s | drpi 0/s | drpo > 0/s | > NET | eth4 0% | pcki 712/s | pcko 54/s | | sp > 10 Gbps | si 50 Mbps | so 90 Kbps | | coll 0/s | > mlti 0/s | erri 0/s | erro 0/s | drpi 0/s | drpo > 0/s | > > PID TID > RDDSK WRDSK > WCANCL DSK CMD > 1/21 > 2067 - > 0K/s 0.0G/s > 0K/s 100% > ceph-osd > > > > > > 2018-04-05 10:55:24.316549 min lat: 0.0203278 max lat: 10.7501 avg lat: > 0.496822 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > lat(s) > 40 16 1096 1080 107.988 0 - > 0.496822 > 41 16 1096 1080 105.354 0 - > 0.496822 > 42 16 1096 1080 102.846 0 - > 0.496822 > 43 16 1096 1080 100.454 0 - > 0.496822 > 44 16 1205 1189 108.079 48.4444 0.0430396 > 0.588127 > 45 16 1234 1218 108.255 116 0.0318717 > 0.575485 > 46 16 1234 1218 105.901 0 - > 0.575485 > 47 16 1234 1218 103.648 0 - > 0.575485 > 48 16 1234 1218 101.489 0 - > 0.575485 > 49 16 1261 1245 101.622 27 0.157469 > 0.604268 > 50 16 1335 1319 105.508 296 0.191907 > 0.604862 > 51 16 1418 1402 109.949 332 0.0367004 > 0.573429 > 52 16 1437 1421 109.296 76 0.031818 > 0.566289 > 53 16 1481 1465 110.554 176 0.0405567 > 0.564885 > 54 16 1516 1500 111.099 140 0.0272873 > 0.552698 > 55 16 1516 1500 109.079 0 - > 0.552698 > 56 16 1516 1500 107.131 0 - > 0.552698 > 57 16 1516 1500 105.252 0 - > 0.552698 > 58 16 1555 1539 106.127 39 0.15675 > 0.601747 > > Total time run: 58.971664 > Total reads made: 1565 > Read size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 106.153 > Average IOPS: 26 > Stddev IOPS: 33 > Max IOPS: 121 > Min IOPS: 0 > Average Latency(s): 0.600788 > Max latency(s): 10.7501 > Min latency(s): 0.019135 > > > megacli -LDGetProp -cache -Lall -a0 > > Adapter 0-VD 0(target id: 0): Cache Policy:WriteThrough, ReadAheadNone, > Direct, Write Cache OK if bad BBU > Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAdaptive, > Cached, No Write Cache if bad BBU > Adapter 0-VD 2(target id: 2): Cache Policy:WriteBack, ReadAdaptive, > Cached, No Write Cache if bad BBU > Adapter 0-VD 3(target id: 3): Cache Policy:WriteBack, ReadAdaptive, > Cached, No Write Cache if bad BBU > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com