Disk controller seem fine Any other suggestions will be really appreciated
megacli -AdpBbuCmd -aAll BBU status for Adapter: 0 BatteryType: BBU Voltage: 3925 mV Current: 0 mA Temperature: 17 C Battery State: Optimal BBU Firmware Status: Charging Status : None Voltage : OK Temperature : OK Learn Cycle Requested : No Learn Cycle Active : No Learn Cycle Status : OK Learn Cycle Timeout : No I2c Errors Detected : No Battery Pack Missing : No Battery Replacement required : No Remaining Capacity Low : No Periodic Learn Required : No Transparent Learn : No No space to cache offload : No Pack is about to fail & should be replaced : No Cache Offload premium feature required : No Module microcode update required : No megacli -AdpDiag -a0 Performing Diagnostic on Controller 0. It will take 20 seconds to complete. Please wait... Diagnostic Completed on Controller 0. Exit Code: 0x00 On Fri, 6 Apr 2018 at 15:11, David Turner <drakonst...@gmail.com> wrote: > First and foremost, have you checked your disk controller. Of most import > would be your cache battery. Any time I have a single node acting up, the > controller is Suspect #1. > > On Thu, Apr 5, 2018 at 11:23 AM Steven Vacaroaia <ste...@gmail.com> wrote: > >> Hi, >> >> I have a strange issue - OSDs from a specific server are introducing huge >> performance issue >> >> This is a brand new installation on 3 identical servers - >> DELL R620 with PERC H710 , bluestore DB and WAL on SSD, 10GB dedicated >> private/public networks >> >> >> When I add the OSD I see gaps like below and huge latency >> >> atop provides no clear culprit EXCEPT very low network and specific disk >> utilization BUT 100% DSK for ceph-osd process which stay like that ( 100%) >> for the duration of the test >> ( see below) >> >> Not sure why ceph-osd process DSK stays at 100% while all the specific >> DSK ( for sdb, sde ..etc) are 1% busy ? >> >> Any help/ instructions for how to troubleshooting this will be >> appreciated >> >> (apologies if the format is not being kept) >> >> >> CPU | sys 4% | user 1% | | irq 1% | >> | idle 794% | wait 0% | | >> | steal 0% | guest 0% | curf 2.20GHz | | >> curscal ?% | >> CPL | avg1 0.00 | | avg5 0.00 | avg15 0.00 | >> | | | csw 547/s | >> | intr 832/s | | | numcpu 8 | >> | >> MEM | tot 62.9G | free 61.4G | cache 520.6M | dirty 0.0M | >> buff 7.5M | slab 98.9M | slrec 64.8M | shmem 8.8M | shrss >> 0.0M | shswp 0.0M | vmbal 0.0M | | hptot 0.0M | >> hpuse 0.0M | >> SWP | tot 6.0G | free 6.0G | | | >> | | | | >> | | | vmcom 1.5G | | vmlim >> 37.4G | >> LVM | dm-0 | busy 1% | | read 0/s | >> write 54/s | | KiB/r 0 | KiB/w 455 | MBr/s >> 0.0 | | MBw/s 24.0 | avq 3.69 | | avio >> 0.14 ms | >> DSK | sdb | busy 1% | | read 0/s | >> write 102/s | | KiB/r 0 | KiB/w 240 | MBr/s >> 0.0 | | MBw/s 24.0 | avq 6.69 | | avio >> 0.08 ms | >> DSK | sda | busy 0% | | read 0/s | >> write 12/s | | KiB/r 0 | KiB/w 4 | MBr/s >> 0.0 | | MBw/s 0.1 | avq 1.00 | | avio >> 0.05 ms | >> DSK | sde | busy 0% | | read 0/s | >> write 0/s | | KiB/r 0 | KiB/w 0 | MBr/s >> 0.0 | | MBw/s 0.0 | avq 1.00 | | avio >> 2.50 ms | >> NET | transport | tcpi 718/s | tcpo 972/s | udpi 0/s | >> | udpo 0/s | tcpao 0/s | tcppo 0/s | tcprs 21/s >> | tcpie 0/s | tcpor 0/s | | udpnp 0/s | udpie >> 0/s | >> NET | network | ipi 719/s | | ipo 399/s | >> ipfrw 0/s | | deliv 719/s | | >> | | | icmpi 0/s | | icmpo >> 0/s | >> NET | eth5 1% | pcki 2214/s | pcko 939/s | | sp >> 10 Gbps | si 154 Mbps | so 52 Mbps | | coll 0/s | >> mlti 0/s | erri 0/s | erro 0/s | drpi 0/s | drpo >> 0/s | >> NET | eth4 0% | pcki 712/s | pcko 54/s | | sp >> 10 Gbps | si 50 Mbps | so 90 Kbps | | coll 0/s | >> mlti 0/s | erri 0/s | erro 0/s | drpi 0/s | drpo >> 0/s | >> >> PID TID >> RDDSK WRDSK >> WCANCL DSK CMD >> 1/21 >> 2067 - >> 0K/s 0.0G/s >> 0K/s 100% >> ceph-osd >> >> >> >> >> >> 2018-04-05 10:55:24.316549 min lat: 0.0203278 max lat: 10.7501 avg lat: >> 0.496822 >> sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg >> lat(s) >> 40 16 1096 1080 107.988 0 - >> 0.496822 >> 41 16 1096 1080 105.354 0 - >> 0.496822 >> 42 16 1096 1080 102.846 0 - >> 0.496822 >> 43 16 1096 1080 100.454 0 - >> 0.496822 >> 44 16 1205 1189 108.079 48.4444 0.0430396 >> 0.588127 >> 45 16 1234 1218 108.255 116 0.0318717 >> 0.575485 >> 46 16 1234 1218 105.901 0 - >> 0.575485 >> 47 16 1234 1218 103.648 0 - >> 0.575485 >> 48 16 1234 1218 101.489 0 - >> 0.575485 >> 49 16 1261 1245 101.622 27 0.157469 >> 0.604268 >> 50 16 1335 1319 105.508 296 0.191907 >> 0.604862 >> 51 16 1418 1402 109.949 332 0.0367004 >> 0.573429 >> 52 16 1437 1421 109.296 76 0.031818 >> 0.566289 >> 53 16 1481 1465 110.554 176 0.0405567 >> 0.564885 >> 54 16 1516 1500 111.099 140 0.0272873 >> 0.552698 >> 55 16 1516 1500 109.079 0 - >> 0.552698 >> 56 16 1516 1500 107.131 0 - >> 0.552698 >> 57 16 1516 1500 105.252 0 - >> 0.552698 >> 58 16 1555 1539 106.127 39 0.15675 >> 0.601747 >> >> Total time run: 58.971664 >> Total reads made: 1565 >> Read size: 4194304 >> Object size: 4194304 >> Bandwidth (MB/sec): 106.153 >> Average IOPS: 26 >> Stddev IOPS: 33 >> Max IOPS: 121 >> Min IOPS: 0 >> Average Latency(s): 0.600788 >> Max latency(s): 10.7501 >> Min latency(s): 0.019135 >> >> >> megacli -LDGetProp -cache -Lall -a0 >> >> Adapter 0-VD 0(target id: 0): Cache Policy:WriteThrough, ReadAheadNone, >> Direct, Write Cache OK if bad BBU >> Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAdaptive, >> Cached, No Write Cache if bad BBU >> Adapter 0-VD 2(target id: 2): Cache Policy:WriteBack, ReadAdaptive, >> Cached, No Write Cache if bad BBU >> Adapter 0-VD 3(target id: 3): Cache Policy:WriteBack, ReadAdaptive, >> Cached, No Write Cache if bad BBU >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com