Hi Guys

We have the current config :

2 x Storage servers, 128gb RAM, dual E5-2609, LSI MegaRAID SAS 9271-4i, each
server has 24 x 3tb Disks ­ these were originally set as 8 groups of 3 Disk
RAID0 (we are slowly moving to one OSD one disk) We initially had the
journals stored on an SSD, however after a disk failure this lead to
terrible performance (a_wait on the SSD was huge) so we added some PCIe
SSDs, this didn¹t change a lot and we had the big a_wait on the SSDs still.
So we removed the SSDs. This made recovery pretty good (we were able to
carry on working while recovery was taking place, previously recovery = VM¹s
down).

Now two mornings in a row I have been paged due to really slow I/O.

It appears that the background deep scrubbing was using all the IO the disks
had, this is the output of iostat ­xk 3 during scrubbing:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          31.25    0.00    6.17   10.79    0.00   51.78

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00    12.33    0.00   26.67     0.00   130.67     9.80
0.07    2.55    0.00    2.55   2.55   6.80
sdi               0.00     8.00  269.33  319.33 31660.00  5124.33   124.98
4.51    7.47    5.68    8.98   1.67  98.13
sdk               0.00     0.00    9.00   88.00   117.33  2208.17    47.95
0.10    1.03   10.67    0.05   0.95   9.20
sdj               0.00     0.00    5.33  114.67    26.67   928.50    15.92
0.09    0.77   12.75    0.21   0.59   7.07
sdg               0.33     0.00  124.33   39.00 15537.33   488.83   196.24
0.46    2.69    3.53    0.00   2.05  33.47
sdf               0.00    10.67   12.33  277.33    96.00  2481.83    17.80
0.88    3.05   14.27    2.55   0.63  18.27
sde               0.00     0.00    3.00   32.67   381.33   280.33    37.10
0.10    2.84    7.11    2.45   1.68   6.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00   28.00     0.00   179.00    12.79
0.05    1.90    0.00    1.90   0.76   2.13
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00   32.33     0.00   130.67     8.08
0.07    2.14    0.00    2.14   2.10   6.80
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdm               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdn               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdo               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdp               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.10    0.00    6.08   23.03    0.00   63.79

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.33     0.00     1.33     8.00
0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00   13.00     0.00   168.00    25.85
0.01    0.92    0.00    0.92   0.92   1.20
sdd               0.00     0.00   21.67  142.33   181.33 24873.33   305.54
29.24  175.61   76.62  190.68   6.09  99.87
sde               0.33     0.00  273.33   10.00 32257.33   242.67   229.41
1.45    5.11    5.25    1.20   3.13  88.67
sdf               0.00     0.00    0.67   44.00     2.67   520.00    23.40
0.03    0.75   20.00    0.45   0.75   3.33
sdg               0.00     0.00    2.00   13.33    68.00   170.67    31.13
0.24   13.30   70.00    4.80  12.61  19.33
sdi               0.00     0.00    1.00   47.67     5.33   767.83    31.77
0.08    1.62    8.00    1.48   0.30   1.47
sdh               0.00     0.00    1.00   21.00    54.67   264.00    28.97
0.06    1.94   10.67    1.52   0.61   1.33
sdj               1.33     0.00  344.33   16.67 42037.33   212.00   234.07
1.50    4.18    4.32    1.28   2.66  95.87
sdk               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.33     0.00     1.33     8.00
0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00


I have disabled scrubbing to return systems to a usable state (osd set
noscrub and osd set no deep-scrub)

And here is how our iostats look now.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          33.33    0.00    6.73    0.67    0.00   59.27

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     6.67    0.00   20.00     0.00    88.00     8.80
0.05    2.53    0.00    2.53   2.53   5.07
sdi               0.00     0.00    1.67   89.33     9.33   520.00    11.63
0.03    0.34   15.20    0.06   0.34   3.07
sdk               0.00     0.00    1.33  139.33     8.00  1203.67    17.23
0.08    0.56   13.00    0.44   0.15   2.13
sdj               0.00     0.67    2.67  175.00   126.67  4583.33    53.02
0.02    0.12    4.00    0.06   0.11   1.87
sdg               0.00     0.00    1.00  151.00     5.33  1288.17    17.02
0.03    0.23   13.33    0.14   0.17   2.53
sdf               0.00     0.00    3.33   93.00    20.00   870.33    18.48
0.03    0.36    9.20    0.04   0.33   3.20
sde               0.00     0.00    0.00   13.33     0.00    72.00    10.80
0.00    0.20    0.00    0.20   0.20   0.27
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    1.33   26.00     8.00   182.67    13.95
0.02    0.68    9.00    0.26   0.63   1.73
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00   21.67     0.00    88.00     8.12
0.05    2.34    0.00    2.34   2.34   5.07
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdm               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdn               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdo               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdp               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.40    0.00    3.65    4.41    0.00   83.55

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00   62.67     0.00   778.00    24.83
0.18    2.89    0.00    2.89   0.28   1.73
sdd               0.00    10.33    1.67  160.00    13.33  1163.33    14.56
23.88   41.49   22.40   41.69   2.41  38.93
sde               0.00     0.00    0.00   50.33     0.00   601.17    23.89
0.11    2.09    0.00    2.09   0.21   1.07
sdf               0.00     0.00    0.00  106.67     0.00  1260.67    23.64
0.41    3.81    0.00    3.81   0.27   2.93
sdg               0.00     0.00    0.00   71.33     0.00   863.50    24.21
0.15    2.15    0.00    2.15   0.37   2.67
sdi               0.00     0.00    0.00   23.67     0.00   366.67    30.99
0.01    0.45    0.00    0.45   0.45   1.07
sdh               0.00     0.00    0.00   73.00     0.00   910.67    24.95
0.20    2.74    0.00    2.74   0.24   1.73
sdj               0.00     0.00    0.00   82.67     0.00   899.67    21.77
0.28    3.39    0.00    3.39   0.23   1.87
sdk               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00

Now clearly we can¹t live with scrubbing off for very long, so what can I do
to stop scrubbing blocking IO or if this was you how would you
reconfigure/rearchitect things ?

As an aside has anyone used one of these to hold the journals and would 1 be
enough for 24 journals ? :
http://www.amazon.com/FUSiON-iO-420GB-Solid-State-Drive/dp/B00DVMPXV0/ref=sr
_1_1?ie=UTF8&qid=1391039407&sr=8-1&keywords=fusion-io

Thanks!
-- 
Geraint Jones
Director of Systems & Infrastructure
Koding 
(We are hiring!)
https://koding.com
gera...@koding.com
Phone (415) 653-0083



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to