Re: [ceph-users] Ceph + VMware + Single Thread Performance

Nick Fisk Wed, 31 Aug 2016 01:31:58 -0700

From: w...@globe.de [mailto:w...@globe.de] 
Sent: 31 August 2016 08:56
To: n...@fisk.me.uk; 'Alex Gorbachev' <a...@iss-integration.com>; 'Horace Ng' 
<hor...@hkisl.net>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance


 

Nick,

what do you think about Infiniband?

I have read that with Infiniband the latency is at 1,2us

It’s great, but I don’t believe the Ceph support for RDMA is finished yet, so 
you are stuck using IPoIB, which has similar performance to 10G Ethernet.

For now concentrate on removing latency where you easily can (3.5+ Ghz CPU’s, 
NVME journals) and then when stuff like RDMA comes along, you will be in a 
better place to take advantage of it.

 

Kind Regards!

 

Am 31.08.16 um 09:51 schrieb Nick Fisk:

 

 

From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de] 
Sent: 30 August 2016 18:40
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Alex Gorbachev'  
<mailto:a...@iss-integration.com> <a...@iss-integration.com>
Cc: 'Horace Ng'  <mailto:hor...@hkisl.net> <hor...@hkisl.net>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick,

here are my answers and questions...

 

Am 30.08.16 um 19:05 schrieb Nick Fisk:

 

 

From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de] 
Sent: 30 August 2016 08:48
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Alex Gorbachev'  
<mailto:a...@iss-integration.com> <a...@iss-integration.com>
Cc: 'Horace Ng'  <mailto:hor...@hkisl.net> <hor...@hkisl.net>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick, Hi Ales,

Nick: i've got my 600GB SAS HP Drives.

Performance is not good soo i don't paste the results here...

 

Generally another thing: I've build in the Ceph Cluster Samsung SM863 
Enterprise SSD's

If i do a 4k Test on the SSD directly without filesystem i become 

(See Sebastien's Han Tests)

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
 
<http://xo4t.mj.am/lnk/AEEAFKSsSPsAAAAAAAAAAF3gduwAADNJBWwAAAAAAACRXwBXxoyWJxD41h5WTsmv5AyUVi8GUwAAlBI/1/kG_bXVmSVXssUysDBe9M-g/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFVUFGSU9hZmQ4QUFBQUFBQUFBQUYzZ2R1d0FBRE5KQld3QUFBQUFBQUNSWHdCWHhieV9JUzdQRkYzQVNBMkpBZ0MyTjBobU53QUFsQkkvMS83Z09JODR1dUhUUWhOc3ViSEg2UmxnL2FIUjBjSE02THk5M2QzY3VjMlZpWVhOMGFXVnVMV2hoYmk1bWNpOWliRzluTHpJd01UUXZNVEF2TVRBdlkyVndhQzFvYjNjdGRHOHRkR1Z6ZEMxcFppMTViM1Z5TFhOelpDMXBjeTF6ZFdsMFlXSnNaUzFoY3kxaExXcHZkWEp1WVd3dFpHVjJhV05sTHc>
 

 

dd if=/dev/zero of=/dev/sdd bs=4k count=1000000 oflag=direct,dsync
1000000+0 Datensätze ein
1000000+0 Datensätze aus
4096000000 bytes (4,1 GB, 3,8 GiB) copied, 52,7139 s, 77,7 MB/s

77000/4 = ~20000 IOP’s

 

If i format the device with xfs i become:

mkfs.xfs -f /dev/sdd

mount /dev/sdd /mnt

cd /mnt

dd if=/dev/zero of=/mnt/test.txt bs=4k count=100000 oflag=direct,dsync
100000+0 Datensätze ein
100000+0 Datensätze aus
409600000 bytes (410 MB, 391 MiB) copied, 21,1856 s, 19,3 MB/s

19300/4 = ~5000 IOPs
I know once you have a FS on the device it will slow down due to the extra 
journal writes, maybe this is a little more than expected here…but still 
reasonably fast. Can you see in iostat how many IO’s the device is doing during 
this test?



watch iostat -dmx -t -y 1 1 /dev/sde

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde               0,00     0,00    0,00 9625,00     0,00    25,85     5,50     
0,60    0,06    0,00    0,06   0,06  59,60




So there seems to be an extra delay somewhere when writing via a FS instead of 
raw device. You are still getting around 10,000 iops though, so not too bad.







If i use the ssd in the ceph cluster and i do the test again with rados bench 
bs=4K and -t = 1 (one thread) i become only 2-3 MByte/s

2500/4 = ~600IOP’s

My question is: How can it be that the pure device performance is so high 
against the xfs and the ceph rbd performance?

Ceph will be a lot slower as you are replacing a 30cm SAS/SATA cable with 
networking, software and also doing replication. You have at least 2 network 
hops with Ceph. For a slightly fairer test set replication to 1x.


Replication 3x:
rados bench -p rbd 60 write -b 4k -t 1
Invalid value for block-size: The option value '4k' seems to be invalid
root@ceph-mon-1:~# rados bench -p rbd 60 write -b 4K -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 
60 seconds or 0 objects
Object prefix: benchmark_data_ceph-mon-1_30407
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1       1       402       401    1.5661   1.56641  0.00226091  0.00248929
    2       1       775       774   1.51142   1.45703   0.0021945  0.00258187
    3       1      1110      1109   1.44374   1.30859  0.00278291  0.00270182
    4       1      1421      1420   1.38647   1.21484  0.00199578  0.00281537
    5       1      1731      1730   1.35132   1.21094  0.00219136  0.00288843
    6       1      2044      2043   1.32985   1.22266   0.0023981  0.00293468
    7       1      2351      2350   1.31116   1.19922  0.00258856  0.00296963
    8       1      2703      2702   1.31911     1.375   0.0224678  0.00295862
    9       1      2955      2954   1.28191  0.984375  0.00841621  0.00304526
   10       1      3228      3227   1.26034   1.06641  0.00261023  0.00309665
   11       1      3501      3500    1.2427   1.06641  0.00659853  0.00313985
   12       1      3791      3790   1.23353   1.13281   0.0027244  0.00316168
   13       1      4150      4149   1.24649   1.40234  0.00262242  0.00313177
   14       1      4460      4459   1.24394   1.21094  0.00262075  0.00313735
   15       1      4721      4720   1.22897   1.01953  0.00239961  0.00317357
   16       1      4983      4982   1.21611   1.02344  0.00290526  0.00321005
   17       1      5279      5278   1.21258   1.15625  0.00252002   0.0032196
   18       1      5605      5604   1.21595   1.27344  0.00281887  0.00320714

Replication 1x:
rados bench -p rbd 60 write -b 4K -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 
60 seconds or 0 objects
Object prefix: benchmark_data_ceph-mon-1_30475
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1       1       687       686   2.67919   2.67969  0.00107786  0.00145603
    2       1      1431      1430   2.79249   2.90625  0.00115586  0.00139696
    3       0      1900      1900   2.47359   1.83594  0.00155059  0.00157747
    4       1      2460      2459   2.40102   2.18359  0.00146014  0.00162519
    5       1      3021      3020   2.35906   2.19141  0.00169354   0.0016538
    6       1      3466      3465   2.25554   1.73828  0.00161923  0.00173006
    7       1      3977      3976   2.21843   1.99609  0.00168673  0.00175753
    8       1      4540      4539   2.21599   2.19922   0.0014385  0.00175701
    9       1      5008      5007   2.17287   1.82812  0.00133871  0.00179582
   10       1      5583      5582   2.18016   2.24609 0.000924145  0.00179001
   11       1      6072      6071   2.15559   1.91016  0.00269532  0.00180909
   12       1      6647      6646    2.1631   2.24609  0.00161325  0.00180369
   13       1      7238      7237   2.17427   2.30859  0.00140135  0.00179482
   14       1      7780      7779   2.17017   2.11719  0.00126382  0.00179823
   15       1      8328      8327   2.16819   2.14062  0.00133369  0.00179995
   16       1      8837      8836   2.15693   1.98828  0.00752524  0.00180926
   17       1      9441      9440   2.16882   2.35938  0.00079956  0.00179942
   18       1     10049     10048   2.18025     2.375  0.00129267  0.00179002
   19       1     10550     10549   2.16849   1.95703  0.00167502  0.00179901

So yes, as expected, turning off replication doubles your speed. This perfectly 
demonstrates the effects of latency serialisation.






I think there must be something wrong or not?

No, not really, that is an average number of IOP’s for a Ceph benchmark. With 
very fast CPU’s, NVME and tuning, you can maybe double it though.

It would be a dream if i had 30-40 MByte/s at 4K with this ssd with the rados 
bench test.

Sorry, not going to happen, there is too much latency involved to even get 
anywhere near that figure. Maybe when persistent RBD caching is available you 
might be able to hit this.

And also if i would get the purely cfs performance 19,3 MByte/s with rados 
would be very good.

Still way off. You would need write latency of around 200us to be able to reach 
this figure, you currently have something near 2000us. You are talking about a 
10x improvement.

2000us = 2ms

But my latency is with a ping (from mon-1 to mon-2 and from mon1 to mon3) (My 
Mon Server are mon and ceph data servers the same in this test with the ssd's):
root@ceph-mon-1:~# ping ceph-mon-2
PING ceph-mon-2 (10.250.250.6) 56(84) bytes of data.
64 bytes from ceph-mon-2 (10.250.250.6): icmp_seq=1 ttl=64 time=0.151 ms
64 bytes from ceph-mon-2 (10.250.250.6): icmp_seq=2 ttl=64 time=0.205 ms
64 bytes from ceph-mon-2 (10.250.250.6): icmp_seq=3 ttl=64 time=0.214 ms
64 bytes from ceph-mon-2 (10.250.250.6): icmp_seq=4 ttl=64 time=0.212 ms
^C
--- ceph-mon-2 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2997ms
rtt min/avg/max/mdev = 0.151/0.195/0.214/0.029 ms
root@ceph-mon-1:~# ping ceph-mon-3
PING ceph-mon-3 (10.250.250.7) 56(84) bytes of data.
64 bytes from ceph-mon-3 (10.250.250.7): icmp_seq=1 ttl=64 time=0.198 ms
64 bytes from ceph-mon-3 (10.250.250.7): icmp_seq=2 ttl=64 time=0.203 ms
64 bytes from ceph-mon-3 (10.250.250.7): icmp_seq=3 ttl=64 time=0.220 ms





I use replication =3 so if the performance is only the half of 19,3 MByte/s 
would be ok. But why so slow?

Think about it, every IO has to:

1.      Travel across the network to the 1st Ceph node

2.      Be processed by Ceph 

3.      Then travel to the 2nd and 3rd nodes

4.      Be processed by them

5.      Then an Ack travels back to the 1st node

6.      Before finally the ack traveling back to the client

Your 10G latency is probably somewhere around 100us, so a minimum of 200us 
before you even take into account the processing Ceph has to do.


Okay we see around 200us before is beeing processed by ceph thats correct. But 
why is ceph in processing so slow?
You see in the rados bench an avg latency of 0.00162s = 1620us minus the 200us 
10G latency = 1420us time processing by ceph??

 

 

Ok, so you ping hops are actually nearer 200us, so ~400us of just network 
latency when doing replication. 

 

Yes, Ceph does add some latency and what you are seeing is about right for a 
untuned cluster, but as I mentioned with fast CPU’s and some tuning, you can 
get this down to about 500us. But either way you will struggle to get much 
above 1500iops at QD=1 with Ceph. That at the moment is the ceiling.




Now compare this to a write IO going across a SAS bus, the whole operation will 
probably take less than ~10-20us

I can not understand where the problem is. Must i set enhanced parameters in my 
ceph.conf? Everything is nearly default at my test cluster.

Try:

1.      Setting your max CPU cstate to 1

2.      Setting your CPU frequency to Max

3.      Setting all debug logging to 0 in Ceph

This might give you a 25-50% boost if you are lucky.

Full 10 GBit/s network.

 

[global]
fsid = 38b29077-cac1-4c2d-9edc-78c3a4384ea3
mon_initial_members = ceph-mon-1, ceph-mon-2, ceph-mon-3
mon_host = 10.124.123.1,10.124.123.2,10.124.123.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public network = 10.250.250.0/24
cluster network = 10.250.249.0/24
osd pool default size = 3

 

Best Regards!

 

 

Am 22.08.16 um 21:38 schrieb Nick Fisk:

 

 

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 22 August 2016 20:30
To: Nick Fisk  <mailto:n...@fisk.me.uk> <n...@fisk.me.uk>
Cc: Wilhelm Redbrake  <mailto:w...@globe.de> <w...@globe.de>; Horace Ng  
<mailto:hor...@hkisl.net> <hor...@hkisl.net>; ceph-users  
<mailto:ceph-users@lists.ceph.com> <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 


On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de 
<mailto:w...@globe.de> > wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

 

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

 

 

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

 

 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

 

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

 

 

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

 

But as mentioned above, thick vmdk’s with vaai might be a really good fit.

 

Any chance thin vs. thick difference could be related to discards?  I saw 
zillions of them in recent testing.

 

 

I was using FILEIO and so discard weren’t working for me. I know fragmentation 
was definitely the cause of the small reads. The VMFS metadata I’m less sure 
of, but it seemed the most likely cause as it only effected write performance 
the 1st time round.

 

 

 

Thanks for your very valuable info on analysis and hw build. 

 

Alex

 




Am 21.08.2016 um 09:31 schrieb Nick Fisk  <mailto:n...@fisk.me.uk> 
<n...@fisk.me.uk>:

>> -----Original Message-----
>> From: Alex Gorbachev [mailto:a...@iss-integration.com]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk  <mailto:n...@fisk.me.uk> <n...@fisk.me.uk>
>> Cc: w...@globe.de <mailto:w...@globe.de> ; Horace Ng  
>> <mailto:hor...@hkisl.net> <hor...@hkisl.net>; ceph-users  
>> <mailto:ceph-users@lists.ceph.com> <ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  <mailto:n...@fisk.me.uk> 
>> <n...@fisk.me.uk> wrote:
>>>> -----Original Message-----
>>>> From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de]
>>>> Sent: 21 July 2016 13:23
>>>> To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Horace Ng'  
>>>> <mailto:hor...@hkisl.net> <hor...@hkisl.net>
>>>> Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>>>
>>>> Okay and what is your plan now to speed up ?
>>>
>>> Now I have come up with a lower latency hardware design, there is not much 
>>> further improvement until persistent RBD caching is
>> implemented, as you will be moving the SSD/NVME closer to the client. But 
>> I'm happy with what I can achieve at the moment. You
>> could also experiment with bcache on the RBD.
>>
>> Reviving this thread, would you be willing to share the details of the low 
>> latency hardware design?  Are you optimizing for NFS or
>> iSCSI?
>
> Both really, just trying to get the write latency as low as possible, as you 
> know, vmware does everything with lots of unbuffered small io's. Eg when you 
> migrate a VM or as thin vmdk's grow.
>
> Even storage vmotions which might kick off 32 threads, as they all roughly 
> fall on the same PG, there still appears to be a bottleneck with contention 
> on the PG itself.
>
> These were the sort of things I was trying to optimise for, to make the time 
> spent in Ceph as minimal as possible for each IO.
>
> So onto the hardware. Through reading various threads and experiments on my 
> own I came to the following conclusions.
>
> -You need highest possible frequency on the CPU cores, which normally also 
> means less of them.
> -Dual sockets are probably bad and will impact performance.
> -Use NVME's for journals to minimise latency
>
> The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
> onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
> this design as well as being very performant for Ceph, also works out very 
> cheap as you are using low end server parts. The whole lot + 12x7.2k disks 
> all goes into a 1U case.
>
> During testing I noticed that by default c-states and p-states slaughter 
> performance. After forcing max cstate to 1 and forcing the CPU frequency up 
> to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or 
> around 1600IOPs, this is at QD=1.
>
> Few other observations:
> 1. Power usage is around 150-200W for this config with 12x7.2k disks
> 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> for more disks.
> 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
> 4. No idea about CPU load for pure SSD nodes, but based on the current disks, 
> you could maybe expect ~10000iops per node, before maxing out CPU's
> 5. Single NVME seems to be able to journal 12 disks with no problem during 
> normal operation, no doubt a specific benchmark could max it out though.
> 6. There are slightly faster Xeon E3's, but price/performance = diminishing 
> returns
>
> Hope that answers all your questions.
> Nick
>
>>
>> Thank you,
>> Alex
>>
>>>
>>>>
>>>> Would it help to put in multiple P3700 per OSD Node to improve performance 
>>>> for a single Thread (example Storage VMotion) ?
>>>
>>> Most likely not, it's all the other parts of the puzzle which are causing 
>>> the latency. ESXi was designed for storage arrays that service
>> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
>> the problem. Disable the BBWC on a RAID controller or
>> SAN and you will the same behaviour.
>>>
>>>>
>>>> Regards
>>>>
>>>>
>>>> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>>>>>> -----Original Message-----
>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>>>>>> Behalf Of w...@globe.de <mailto:w...@globe.de> 
>>>>>> Sent: 21 July 2016 13:04
>>>>>> To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Horace Ng'  
>>>>>> <mailto:hor...@hkisl.net> <hor...@hkisl.net>
>>>>>> Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
>>>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
>>>>>> Performance
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
>>>>>> right now?
>>>>> It's just been built, not running yet.
>>>>>
>>>>>> So if you start a storage migration you get only 200 MByte/s right?
>>>>> I wish. My current cluster (not this new one) would storage migrate
>>>>> at ~10-15MB/s. Serial latency is the problem, without being able to
>>>>> buffer, ESXi waits on an ack for each IO before sending the next.
>>>>> Also it submits the migrations in 64kb chunks, unless you get VAAI
>>>> working. I think esxi will try and do them in parallel, which will help as 
>>>> well.
>>>>>
>>>>>> I think it would be awesome if you get 1000 MByte/s
>>>>>>
>>>>>> Where is the Bottleneck?
>>>>> Latency serialisation, without a buffer, you can't drive the
>>>>> devices to 100%. With buffered IO (or high queue depths) I can max out 
>>>>> the journals.
>>>>>
>>>>>> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from 
>>>>>> the P3700.
>>>>>>
>>>>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-y
>>>>>> our -ssd-is-suitable-as-a-journal-device/
>>>>>>
>>>>>> How could it be that the rbd client performance is 50% slower?
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>>
>>>>>>> Am 21.07.16 um 12:15 schrieb Nick Fisk:
>>>>>>> I've had a lot of pain with this, smaller block sizes are even worse.
>>>>>>> You want to try and minimize latency at every point as there is
>>>>>>> no buffering happening in the iSCSI stack. This means:-
>>>>>>>
>>>>>>> 1. Fast journals (NVME or NVRAM)
>>>>>>> 2. 10GB or better networking
>>>>>>> 3. Fast CPU's (Ghz)
>>>>>>> 4. Fix CPU c-state's to C1
>>>>>>> 5. Fix CPU's Freq to max
>>>>>>>
>>>>>>> Also I can't be sure, but I think there is a metadata update
>>>>>>> happening with VMFS, particularly if you are using thin VMDK's,
>>>>>>> this can also be a major bottleneck. For my use case, I've
>>>>>>> switched over to NFS as it has given much more performance at
>>>>>>> scale and
>>>> less headache.
>>>>>>>
>>>>>>> For the RADOS Run, here you go (400GB P3700):
>>>>>>>
>>>>>>> Total time run:         60.026491
>>>>>>> Total writes made:      3104
>>>>>>> Write size:             4194304
>>>>>>> Object size:            4194304
>>>>>>> Bandwidth (MB/sec):     206.842
>>>>>>> Stddev Bandwidth:       8.10412
>>>>>>> Max bandwidth (MB/sec): 224
>>>>>>> Min bandwidth (MB/sec): 180
>>>>>>> Average IOPS:           51
>>>>>>> Stddev IOPS:            2
>>>>>>> Max IOPS:               56
>>>>>>> Min IOPS:               45
>>>>>>> Average Latency(s):     0.0193366
>>>>>>> Stddev Latency(s):      0.00148039
>>>>>>> Max latency(s):         0.0377946
>>>>>>> Min latency(s):         0.015909
>>>>>>>
>>>>>>> Nick
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>>>>>>>> Behalf Of Horace
>>>>>>>> Sent: 21 July 2016 10:26
>>>>>>>> To: w...@globe.de <mailto:w...@globe.de> 
>>>>>>>> Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
>>>>>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
>>>>>>>> Performance
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Same here, I've read some blog saying that vmware will
>>>>>>>> frequently verify the locking on VMFS over iSCSI, hence it will have 
>>>>>>>> much slower performance than NFS (with different
>> locking mechanism).
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Horace Ng
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>> From: w...@globe.de <mailto:w...@globe.de> 
>>>>>>>> To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
>>>>>>>> Sent: Thursday, July 21, 2016 5:11:21 PM
>>>>>>>> Subject: [ceph-users] Ceph + VMware + Single Thread Performance
>>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> we see at our cluster relatively slow Single Thread Performance on the 
>>>>>>>> iscsi Nodes.
>>>>>>>>
>>>>>>>>
>>>>>>>> Our setup:
>>>>>>>>
>>>>>>>> 3 Racks:
>>>>>>>>
>>>>>>>> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache 
>>>>>>>> off).
>>>>>>>>
>>>>>>>> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and
>>>>>>>> 6x WD Red 1TB per Data Node as OSD.
>>>>>>>>
>>>>>>>> Replication = 3
>>>>>>>>
>>>>>>>> chooseleaf = 3 type Rack in the crush map
>>>>>>>>
>>>>>>>>
>>>>>>>> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
>>>>>>>>
>>>>>>>> rados bench -p rbd 60 write -b 4M -t 1
>>>>>>>>
>>>>>>>>
>>>>>>>> If we test with:
>>>>>>>>
>>>>>>>> rados bench -p rbd 60 write -b 4M -t 32
>>>>>>>>
>>>>>>>> we get ca. 600 - 700 MByte/s
>>>>>>>>
>>>>>>>>
>>>>>>>> We plan to replace the Samsung SSD with Intel DC P3700 PCIe
>>>>>>>> NVM'e for the Journal to get better Single Thread Performance.
>>>>>>>>
>>>>>>>> Is anyone of you out there who has an Intel P3700 for Journal an
>>>>>>>> can give me back test results with:
>>>>>>>>
>>>>>>>>
>>>>>>>> rados bench -p rbd 60 write -b 4M -t 1
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you very much !!
>>>>>>>>
>>>>>>>> Kind Regards !!
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

--

Alex Gorbachev

Storcium

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMware + Single Thread Performance

Reply via email to