On 07/02/15 13:49, German Anders wrote:
> output from iostat:
>
> CEPHOSD01:
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdc(ceph-0)       0.00     0.00    1.00  389.00     0.00    35.98  
> 188.96    60.32  120.12   16.00  120.39   1.26  49.20
> sdd(ceph-1)       0.00     0.00    0.00    0.00     0.00     0.00    
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdf(ceph-2)       0.00     1.00    6.00  521.00     0.02    60.72  
> 236.05   143.10  309.75  484.00  307.74   1.90 100.00
> sdg(ceph-3)       0.00     0.00   11.00  535.00     0.04    42.41  
> 159.22   139.25  279.72  394.18  277.37   1.83 100.00
> sdi(ceph-4)       0.00     1.00    4.00  560.00     0.02    54.87  
> 199.32   125.96  187.07  562.00  184.39   1.65  93.20
> sdj(ceph-5)       0.00     0.00    0.00  566.00     0.00    61.41  
> 222.19   109.13  169.62    0.00  169.62   1.53  86.40
> sdl(ceph-6)       0.00     0.00    8.00    0.00     0.09     0.00   
> 23.00     0.12   12.00   12.00    0.00   2.50   2.00
> sdm(ceph-7)       0.00     0.00    2.00  481.00     0.01    44.59  
> 189.12   116.64  241.41  268.00  241.30   2.05  99.20
> sdn(ceph-8)       0.00     0.00    1.00    0.00     0.00     0.00    
> 8.00     0.01    8.00    8.00    0.00   8.00   0.80
> fioa              0.00     0.00    0.00 1016.00     0.00    19.09   
> 38.47     0.00    0.06    0.00    0.06   0.00   0.00
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdc(ceph-0)       0.00     1.00   10.00  278.00     0.04    26.07  
> 185.69    60.82  257.97  309.60  256.12   2.83  81.60
> sdd(ceph-1)       0.00     0.00    2.00    0.00     0.02     0.00   
> 20.00     0.02   10.00   10.00    0.00  10.00   2.00
> sdf(ceph-2)       0.00     1.00    6.00  579.00     0.02    54.16  
> 189.68   142.78  246.55  328.67  245.70   1.71 100.00
> sdg(ceph-3)       0.00     0.00   10.00   75.00     0.05     5.32  
> 129.41     4.94  185.08   11.20  208.27   4.05  34.40
> sdi(ceph-4)       0.00     0.00   19.00  147.00     0.09    12.61  
> 156.63    17.88  230.89  114.32  245.96   3.37  56.00
> sdj(ceph-5)       0.00     1.00    2.00  629.00     0.01    43.66  
> 141.72   143.00  223.35  426.00  222.71   1.58 100.00
> sdl(ceph-6)       0.00     0.00   10.00    0.00     0.04     0.00    
> 8.00     0.16   18.40   18.40    0.00   5.60   5.60
> sdm(ceph-7)       0.00     0.00   11.00    4.00     0.05     0.01    
> 8.00     0.48   35.20   25.82   61.00  14.13  21.20
> sdn(ceph-8)       0.00     0.00    9.00    0.00     0.07     0.00   
> 15.11     0.07    8.00    8.00    0.00   4.89   4.40
> fioa              0.00     0.00    0.00 6415.00     0.00   125.81   
> 40.16     0.00    0.14    0.00    0.14   0.00   0.00
>
> CEPHOSD02:
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdc1(ceph-9)      0.00     0.00   13.00    0.00     0.11     0.00   
> 16.62     0.17   13.23   13.23    0.00   4.92   6.40
> sdd1(ceph-10)     0.00     0.00   15.00    0.00     0.13     0.00   
> 18.13     0.26   17.33   17.33    0.00   1.87   2.80
> sdf1(ceph-11)     0.00     0.00   22.00  650.00     0.11    51.75  
> 158.04   143.27  212.07  308.55  208.81   1.49 100.00
> sdg1(ceph-12)     0.00     0.00   12.00  282.00     0.05    54.60  
> 380.68    13.16  120.52  352.00  110.67   2.91  85.60
> sdi1(ceph-13)     0.00     0.00    1.00    0.00     0.00     0.00    
> 8.00     0.01    8.00    8.00    0.00   8.00   0.80
> sdj1(ceph-14)     0.00     0.00   20.00    0.00     0.08     0.00    
> 8.00     0.26   12.80   12.80    0.00   3.60   7.20
> sdl1(ceph-15)     0.00     0.00    0.00    0.00     0.00     0.00    
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdm1(ceph-16)     0.00     0.00   20.00  424.00     0.11    32.20  
> 149.05    89.69  235.30  243.00  234.93   2.14  95.20
> sdn1(ceph-17)     0.00     0.00    5.00  411.00     0.02    45.47  
> 223.94    98.32  182.28 1057.60  171.63   2.40 100.00
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdc1(ceph-9)      0.00     0.00   26.00  383.00     0.11    34.32  
> 172.44    86.92  258.64  297.08  256.03   2.29  93.60
> sdd1(ceph-10)     0.00     0.00    8.00   31.00     0.09     1.86  
> 101.95     0.84  178.15   94.00  199.87   6.46  25.20
> sdf1(ceph-11)     0.00     1.00    5.00  409.00     0.05    48.34  
> 239.34    90.94  219.43  383.20  217.43   2.34  96.80
> sdg1(ceph-12)     0.00     0.00    0.00  238.00     0.00     1.64   
> 14.12    58.34  143.60    0.00  143.60   1.83  43.60
> sdi1(ceph-13)     0.00     0.00   11.00    0.00     0.05     0.00   
> 10.18     0.16   14.18   14.18    0.00   5.09   5.60
> sdj1(ceph-14)     0.00     0.00    1.00    0.00     0.00     0.00    
> 8.00     0.02   16.00   16.00    0.00  16.00   1.60
> sdl1(ceph-15)     0.00     0.00    1.00    0.00     0.03     0.00   
> 64.00     0.01   12.00   12.00    0.00  12.00   1.20
> sdm1(ceph-16)     0.00     1.00    4.00  587.00     0.03    50.09  
> 173.69   143.32  244.97  296.00  244.62   1.69 100.00
> sdn1(ceph-17)     0.00     0.00    0.00  375.00     0.00    23.68  
> 129.34    69.76  182.51    0.00  182.51   2.47  92.80

If the iostat output is typical it seems you are limited by random
writes on a subset of your OSDs (you have 9 on each server but you have
between 4 and 6 used for writes and wMB/s vs w/s points to a moderately
random access pattern).
You should find out why. You may have a configuration problem or the
access to your rbds is focused on a few 4MB (if you used the defaults)
sections of the devices.

>
> The other OSD server had pretty much the same load.
>
> The config of the OSD's is the following:
>
> - 2x Intel Xeon E5-2609 v2 @ 2.50GHz (4C)
> - 128G RAM
> - 2x 120G SSD Intel SSDSC2BB12 (RAID-1) for OS
> - 2x 10GbE ADPT DP
> - Journals are configured to run on RAMDISK (TMPFS), but in the first
> OSD serv we've the journals going on to a FusionIO (/dev/fioa) ADPT
> with batt.

I suppose this is not yet production (TMPFS journals). You only have
128G RAM for 9 OSD, what is the size of your journals when you use TMPFS
and more importantly what is the value of filestore sync max interval ?
I'm not sure how the OSD will react with a journal with a multi-GB/s
write bandwidth: the default filestore sync max interval might be too
high (it should prevent the journal from filling up). On the other end a
low max interval will prevent the OS from reordering writes to hard
drives to avoid too much random IO.

So there might be two causes I can see that might lead to performance
problems:
- the IO load might not be distributed to all your OSD limiting your
total bandwidth,
- you might have IO freezes when your TMPFS journals fill up if you have
very high bursts (probably unlikely but the consequences might be dire).

Another problem I see is cost : Fusion IO speed and cost (and TMPFS
speed) are probably overkill for the journals in your case. With your
setup 2x Intel DC S3500 would probably be enough (unless you need more
write endurance).
With what you save not using a Fusion IO card in each server you could
probably have additional servers and get far better performance overall.

If you do, use a 10GB journal size and a filestore max sync interval
allowing only half of it to be written to. With 2x 500MB/s write
bandwidth divided between 9 balanced OSD this would be 110MB/s so you
could use 30s with room to spare.

This assumes you can distribute IOs to all OSDs, you might have to
convert your rbds to a lower order or use stripping to achieve this if
you have atypical access patterns.

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to