Re: [ceph-users] Ceph + VMware + Single Thread Performance

Brian :: Sun, 21 Aug 2016 01:17:24 -0700

Hi Nick

Interested in this comment - "-Dual sockets are probably bad and will
impact performance."


Have you got real world experience of this being the case?

Thanks - B

On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk <n...@fisk.me.uk> wrote:
>> -----Original Message-----
>> From: Alex Gorbachev [mailto:a...@iss-integration.com]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk <n...@fisk.me.uk>
>> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users 
>> <ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote:
>> >> -----Original Message-----
>> >> From: w...@globe.de [mailto:w...@globe.de]
>> >> Sent: 21 July 2016 13:23
>> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
>> >> Cc: ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>> >>
>> >> Okay and what is your plan now to speed up ?
>> >
>> > Now I have come up with a lower latency hardware design, there is not much 
>> > further improvement until persistent RBD caching is
>> implemented, as you will be moving the SSD/NVME closer to the client. But 
>> I'm happy with what I can achieve at the moment. You
>> could also experiment with bcache on the RBD.
>>
>> Reviving this thread, would you be willing to share the details of the low 
>> latency hardware design?  Are you optimizing for NFS or
>> iSCSI?
>
> Both really, just trying to get the write latency as low as possible, as you 
> know, vmware does everything with lots of unbuffered small io's. Eg when you 
> migrate a VM or as thin vmdk's grow.
>
> Even storage vmotions which might kick off 32 threads, as they all roughly 
> fall on the same PG, there still appears to be a bottleneck with contention 
> on the PG itself.
>
> These were the sort of things I was trying to optimise for, to make the time 
> spent in Ceph as minimal as possible for each IO.
>
> So onto the hardware. Through reading various threads and experiments on my 
> own I came to the following conclusions.
>
> -You need highest possible frequency on the CPU cores, which normally also 
> means less of them.
> -Dual sockets are probably bad and will impact performance.
> -Use NVME's for journals to minimise latency
>
> The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
> onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
> this design as well as being very performant for Ceph, also works out very 
> cheap as you are using low end server parts. The whole lot + 12x7.2k disks 
> all goes into a 1U case.
>
> During testing I noticed that by default c-states and p-states slaughter 
> performance. After forcing max cstate to 1 and forcing the CPU frequency up 
> to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or 
> around 1600IOPs, this is at QD=1.
>
> Few other observations:
> 1. Power usage is around 150-200W for this config with 12x7.2k disks
> 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> for more disks.
> 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
> 4. No idea about CPU load for pure SSD nodes, but based on the current disks, 
> you could maybe expect ~10000iops per node, before maxing out CPU's
> 5. Single NVME seems to be able to journal 12 disks with no problem during 
> normal operation, no doubt a specific benchmark could max it out though.
> 6. There are slightly faster Xeon E3's, but price/performance = diminishing 
> returns
>
> Hope that answers all your questions.
> Nick
>
>>
>> Thank you,
>> Alex
>>
>> >
>> >>
>> >> Would it help to put in multiple P3700 per OSD Node to improve 
>> >> performance for a single Thread (example Storage VMotion) ?
>> >
>> > Most likely not, it's all the other parts of the puzzle which are causing 
>> > the latency. ESXi was designed for storage arrays that service
>> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
>> the problem. Disable the BBWC on a RAID controller or
>> SAN and you will the same behaviour.
>> >
>> >>
>> >> Regards
>> >>
>> >>
>> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>> >> >> -----Original Message-----
>> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> >> >> Behalf Of w...@globe.de
>> >> >> Sent: 21 July 2016 13:04
>> >> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
>> >> >> Cc: ceph-users@lists.ceph.com
>> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
>> >> >> Performance
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
>> >> >> right now?
>> >> > It's just been built, not running yet.
>> >> >
>> >> >> So if you start a storage migration you get only 200 MByte/s right?
>> >> > I wish. My current cluster (not this new one) would storage migrate
>> >> > at ~10-15MB/s. Serial latency is the problem, without being able to
>> >> > buffer, ESXi waits on an ack for each IO before sending the next.
>> >> > Also it submits the migrations in 64kb chunks, unless you get VAAI
>> >> working. I think esxi will try and do them in parallel, which will help 
>> >> as well.
>> >> >
>> >> >> I think it would be awesome if you get 1000 MByte/s
>> >> >>
>> >> >> Where is the Bottleneck?
>> >> > Latency serialisation, without a buffer, you can't drive the
>> >> > devices to 100%. With buffered IO (or high queue depths) I can max out 
>> >> > the journals.
>> >> >
>> >> >> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from 
>> >> >> the P3700.
>> >> >>
>> >> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-y
>> >> >> our -ssd-is-suitable-as-a-journal-device/
>> >> >>
>> >> >> How could it be that the rbd client performance is 50% slower?
>> >> >>
>> >> >> Regards
>> >> >>
>> >> >>
>> >> >> Am 21.07.16 um 12:15 schrieb Nick Fisk:
>> >> >>> I've had a lot of pain with this, smaller block sizes are even worse.
>> >> >>> You want to try and minimize latency at every point as there is
>> >> >>> no buffering happening in the iSCSI stack. This means:-
>> >> >>>
>> >> >>> 1. Fast journals (NVME or NVRAM)
>> >> >>> 2. 10GB or better networking
>> >> >>> 3. Fast CPU's (Ghz)
>> >> >>> 4. Fix CPU c-state's to C1
>> >> >>> 5. Fix CPU's Freq to max
>> >> >>>
>> >> >>> Also I can't be sure, but I think there is a metadata update
>> >> >>> happening with VMFS, particularly if you are using thin VMDK's,
>> >> >>> this can also be a major bottleneck. For my use case, I've
>> >> >>> switched over to NFS as it has given much more performance at
>> >> >>> scale and
>> >> less headache.
>> >> >>>
>> >> >>> For the RADOS Run, here you go (400GB P3700):
>> >> >>>
>> >> >>> Total time run:         60.026491
>> >> >>> Total writes made:      3104
>> >> >>> Write size:             4194304
>> >> >>> Object size:            4194304
>> >> >>> Bandwidth (MB/sec):     206.842
>> >> >>> Stddev Bandwidth:       8.10412
>> >> >>> Max bandwidth (MB/sec): 224
>> >> >>> Min bandwidth (MB/sec): 180
>> >> >>> Average IOPS:           51
>> >> >>> Stddev IOPS:            2
>> >> >>> Max IOPS:               56
>> >> >>> Min IOPS:               45
>> >> >>> Average Latency(s):     0.0193366
>> >> >>> Stddev Latency(s):      0.00148039
>> >> >>> Max latency(s):         0.0377946
>> >> >>> Min latency(s):         0.015909
>> >> >>>
>> >> >>> Nick
>> >> >>>
>> >> >>>> -----Original Message-----
>> >> >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> >> >>>> Behalf Of Horace
>> >> >>>> Sent: 21 July 2016 10:26
>> >> >>>> To: w...@globe.de
>> >> >>>> Cc: ceph-users@lists.ceph.com
>> >> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
>> >> >>>> Performance
>> >> >>>>
>> >> >>>> Hi,
>> >> >>>>
>> >> >>>> Same here, I've read some blog saying that vmware will
>> >> >>>> frequently verify the locking on VMFS over iSCSI, hence it will have 
>> >> >>>> much slower performance than NFS (with different
>> locking mechanism).
>> >> >>>>
>> >> >>>> Regards,
>> >> >>>> Horace Ng
>> >> >>>>
>> >> >>>> ----- Original Message -----
>> >> >>>> From: w...@globe.de
>> >> >>>> To: ceph-users@lists.ceph.com
>> >> >>>> Sent: Thursday, July 21, 2016 5:11:21 PM
>> >> >>>> Subject: [ceph-users] Ceph + VMware + Single Thread Performance
>> >> >>>>
>> >> >>>> Hi everyone,
>> >> >>>>
>> >> >>>> we see at our cluster relatively slow Single Thread Performance on 
>> >> >>>> the iscsi Nodes.
>> >> >>>>
>> >> >>>>
>> >> >>>> Our setup:
>> >> >>>>
>> >> >>>> 3 Racks:
>> >> >>>>
>> >> >>>> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd 
>> >> >>>> cache off).
>> >> >>>>
>> >> >>>> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and
>> >> >>>> 6x WD Red 1TB per Data Node as OSD.
>> >> >>>>
>> >> >>>> Replication = 3
>> >> >>>>
>> >> >>>> chooseleaf = 3 type Rack in the crush map
>> >> >>>>
>> >> >>>>
>> >> >>>> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
>> >> >>>>
>> >> >>>> rados bench -p rbd 60 write -b 4M -t 1
>> >> >>>>
>> >> >>>>
>> >> >>>> If we test with:
>> >> >>>>
>> >> >>>> rados bench -p rbd 60 write -b 4M -t 32
>> >> >>>>
>> >> >>>> we get ca. 600 - 700 MByte/s
>> >> >>>>
>> >> >>>>
>> >> >>>> We plan to replace the Samsung SSD with Intel DC P3700 PCIe
>> >> >>>> NVM'e for the Journal to get better Single Thread Performance.
>> >> >>>>
>> >> >>>> Is anyone of you out there who has an Intel P3700 for Journal an
>> >> >>>> can give me back test results with:
>> >> >>>>
>> >> >>>>
>> >> >>>> rados bench -p rbd 60 write -b 4M -t 1
>> >> >>>>
>> >> >>>>
>> >> >>>> Thank you very much !!
>> >> >>>>
>> >> >>>> Kind Regards !!
>> >> >>>>
>> >> >>>> _______________________________________________
>> >> >>>> ceph-users mailing list
>> >> >>>> ceph-users@lists.ceph.com
>> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>>> _______________________________________________
>> >> >>>> ceph-users mailing list
>> >> >>>> ceph-users@lists.ceph.com
>> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> _______________________________________________
>> >> >> ceph-users mailing list
>> >> >> ceph-users@lists.ceph.com
>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMware + Single Thread Performance

Reply via email to