First and foremost, have you checked your disk controller.  Of most import
would be your cache battery.  Any time I have a single node acting up, the
controller is Suspect #1.

On Thu, Apr 5, 2018 at 11:23 AM Steven Vacaroaia <ste...@gmail.com> wrote:

> Hi,
>
> I have a strange issue - OSDs from a specific server are introducing huge
> performance issue
>
> This is a brand new installation on 3 identical servers -
>  DELL R620 with PERC H710 , bluestore  DB and WAL on SSD, 10GB dedicated
> private/public networks
>
>
> When I add the OSD I see gaps like below and huge latency
>
> atop provides no  clear culprit EXCEPT very low network and specific disk
> utilization BUT 100% DSK for ceph-osd process  which stay like that ( 100%)
> for the duration of the test
> ( see below)
>
> Not sure why ceph-osd process  DSK stays at 100% while all the specific
> DSK ( for sdb, sde ..etc) are 1% busy ?
>
> Any help/ instructions for how to troubleshooting this will be appreciated
>
> (apologies if the format is not being kept)
>
>
> CPU | sys       4%  | user      1%  |               | irq       1%  |
>          | idle    794%  | wait      0%  |              |               |
> steal     0% |  guest     0% |  curf 2.20GHz |               |  curscal
>  ?% |
> CPL | avg1    0.00  |               | avg5    0.00  | avg15   0.00  |
>          |               |               | csw    547/s |               |
> intr   832/s |               |               |  numcpu     8 |
>  |
> MEM | tot    62.9G  | free   61.4G  | cache 520.6M  | dirty   0.0M  |
> buff    7.5M  | slab   98.9M  | slrec  64.8M  | shmem   8.8M |  shrss
>  0.0M |  shswp   0.0M |  vmbal   0.0M |               |  hptot   0.0M |
> hpuse   0.0M |
> SWP | tot     6.0G  | free    6.0G  |               |               |
>          |               |               |              |               |
>              |               |  vmcom   1.5G |               |  vmlim
> 37.4G |
> LVM |         dm-0  | busy      1%  |               | read     0/s  |
> write   54/s  |               | KiB/r      0  | KiB/w    455 |  MBr/s
> 0.0 |               |  MBw/s   24.0 |  avq     3.69 |               |  avio
> 0.14 ms |
> DSK |          sdb  | busy      1%  |               | read     0/s  |
> write  102/s  |               | KiB/r      0  | KiB/w    240 |  MBr/s
> 0.0 |               |  MBw/s   24.0 |  avq     6.69 |               |  avio
> 0.08 ms |
> DSK |          sda  | busy      0%  |               | read     0/s  |
> write   12/s  |               | KiB/r      0  | KiB/w      4 |  MBr/s
> 0.0 |               |  MBw/s    0.1 |  avq     1.00 |               |  avio
> 0.05 ms |
> DSK |          sde  | busy      0%  |               | read     0/s  |
> write    0/s  |               | KiB/r      0  | KiB/w      0 |  MBr/s
> 0.0 |               |  MBw/s    0.0 |  avq     1.00 |               |  avio
> 2.50 ms |
> NET | transport     | tcpi   718/s  | tcpo   972/s  | udpi     0/s  |
>          | udpo     0/s  | tcpao    0/s  | tcppo    0/s |  tcprs   21/s |
> tcpie    0/s |  tcpor    0/s |               |  udpnp    0/s |  udpie
> 0/s |
> NET | network       | ipi    719/s  |               | ipo    399/s  |
> ipfrw    0/s  |               | deliv  719/s  |              |
>  |               |               |  icmpi    0/s |               |  icmpo
>   0/s |
> NET | eth5      1%  | pcki  2214/s  | pcko   939/s  |               | sp
>  10 Gbps  | si  154 Mbps  | so   52 Mbps  |              |  coll     0/s |
> mlti     0/s |  erri     0/s |  erro     0/s |  drpi     0/s |  drpo
>  0/s |
> NET | eth4      0%  | pcki   712/s  | pcko    54/s  |               | sp
>  10 Gbps  | si   50 Mbps  | so   90 Kbps  |              |  coll     0/s |
> mlti     0/s |  erri     0/s |  erro     0/s |  drpi     0/s |  drpo
>  0/s |
>
>     PID                                 TID
>  RDDSK                               WRDSK
>  WCANCL                               DSK                              CMD
>      1/21
>    2067                                   -
> 0K/s                              0.0G/s
>  0K/s                              100%
> ceph-osd
>
>
>
>
>
> 2018-04-05 10:55:24.316549 min lat: 0.0203278 max lat: 10.7501 avg lat:
> 0.496822
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
>    40      16      1096      1080   107.988         0           -
> 0.496822
>    41      16      1096      1080   105.354         0           -
> 0.496822
>    42      16      1096      1080   102.846         0           -
> 0.496822
>    43      16      1096      1080   100.454         0           -
> 0.496822
>    44      16      1205      1189   108.079   48.4444   0.0430396
> 0.588127
>    45      16      1234      1218   108.255       116   0.0318717
> 0.575485
>    46      16      1234      1218   105.901         0           -
> 0.575485
>    47      16      1234      1218   103.648         0           -
> 0.575485
>    48      16      1234      1218   101.489         0           -
> 0.575485
>    49      16      1261      1245   101.622        27    0.157469
> 0.604268
>    50      16      1335      1319   105.508       296    0.191907
> 0.604862
>    51      16      1418      1402   109.949       332   0.0367004
> 0.573429
>    52      16      1437      1421   109.296        76    0.031818
> 0.566289
>    53      16      1481      1465   110.554       176   0.0405567
> 0.564885
>    54      16      1516      1500   111.099       140   0.0272873
> 0.552698
>    55      16      1516      1500   109.079         0           -
> 0.552698
>    56      16      1516      1500   107.131         0           -
> 0.552698
>    57      16      1516      1500   105.252         0           -
> 0.552698
>    58      16      1555      1539   106.127        39     0.15675
> 0.601747
>
> Total time run:       58.971664
> Total reads made:     1565
> Read size:            4194304
> Object size:          4194304
> Bandwidth (MB/sec):   106.153
> Average IOPS:         26
> Stddev IOPS:          33
> Max IOPS:             121
> Min IOPS:             0
> Average Latency(s):   0.600788
> Max latency(s):       10.7501
> Min latency(s):       0.019135
>
>
> megacli -LDGetProp -cache -Lall -a0
>
> Adapter 0-VD 0(target id: 0): Cache Policy:WriteThrough, ReadAheadNone,
> Direct, Write Cache OK if bad BBU
> Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAdaptive,
> Cached, No Write Cache if bad BBU
> Adapter 0-VD 2(target id: 2): Cache Policy:WriteBack, ReadAdaptive,
> Cached, No Write Cache if bad BBU
> Adapter 0-VD 3(target id: 3): Cache Policy:WriteBack, ReadAdaptive,
> Cached, No Write Cache if bad BBU
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to