Hi Nick Interested in this comment - "-Dual sockets are probably bad and will impact performance."
Have you got real world experience of this being the case? Thanks - B On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk <n...@fisk.me.uk> wrote: >> -----Original Message----- >> From: Alex Gorbachev [mailto:a...@iss-integration.com] >> Sent: 21 August 2016 04:15 >> To: Nick Fisk <n...@fisk.me.uk> >> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users >> <ceph-users@lists.ceph.com> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance >> >> Hi Nick, >> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote: >> >> -----Original Message----- >> >> From: w...@globe.de [mailto:w...@globe.de] >> >> Sent: 21 July 2016 13:23 >> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net> >> >> Cc: ceph-users@lists.ceph.com >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance >> >> >> >> Okay and what is your plan now to speed up ? >> > >> > Now I have come up with a lower latency hardware design, there is not much >> > further improvement until persistent RBD caching is >> implemented, as you will be moving the SSD/NVME closer to the client. But >> I'm happy with what I can achieve at the moment. You >> could also experiment with bcache on the RBD. >> >> Reviving this thread, would you be willing to share the details of the low >> latency hardware design? Are you optimizing for NFS or >> iSCSI? > > Both really, just trying to get the write latency as low as possible, as you > know, vmware does everything with lots of unbuffered small io's. Eg when you > migrate a VM or as thin vmdk's grow. > > Even storage vmotions which might kick off 32 threads, as they all roughly > fall on the same PG, there still appears to be a bottleneck with contention > on the PG itself. > > These were the sort of things I was trying to optimise for, to make the time > spent in Ceph as minimal as possible for each IO. > > So onto the hardware. Through reading various threads and experiments on my > own I came to the following conclusions. > > -You need highest possible frequency on the CPU cores, which normally also > means less of them. > -Dual sockets are probably bad and will impact performance. > -Use NVME's for journals to minimise latency > > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel > P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T > onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually > this design as well as being very performant for Ceph, also works out very > cheap as you are using low end server parts. The whole lot + 12x7.2k disks > all goes into a 1U case. > > During testing I noticed that by default c-states and p-states slaughter > performance. After forcing max cstate to 1 and forcing the CPU frequency up > to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or > around 1600IOPs, this is at QD=1. > > Few other observations: > 1. Power usage is around 150-200W for this config with 12x7.2k disks > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom > for more disks. > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage > 4. No idea about CPU load for pure SSD nodes, but based on the current disks, > you could maybe expect ~10000iops per node, before maxing out CPU's > 5. Single NVME seems to be able to journal 12 disks with no problem during > normal operation, no doubt a specific benchmark could max it out though. > 6. There are slightly faster Xeon E3's, but price/performance = diminishing > returns > > Hope that answers all your questions. > Nick > >> >> Thank you, >> Alex >> >> > >> >> >> >> Would it help to put in multiple P3700 per OSD Node to improve >> >> performance for a single Thread (example Storage VMotion) ? >> > >> > Most likely not, it's all the other parts of the puzzle which are causing >> > the latency. ESXi was designed for storage arrays that service >> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence >> the problem. Disable the BBWC on a RAID controller or >> SAN and you will the same behaviour. >> > >> >> >> >> Regards >> >> >> >> >> >> Am 21.07.16 um 14:17 schrieb Nick Fisk: >> >> >> -----Original Message----- >> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On >> >> >> Behalf Of w...@globe.de >> >> >> Sent: 21 July 2016 13:04 >> >> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net> >> >> >> Cc: ceph-users@lists.ceph.com >> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread >> >> >> Performance >> >> >> >> >> >> Hi, >> >> >> >> >> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production >> >> >> right now? >> >> > It's just been built, not running yet. >> >> > >> >> >> So if you start a storage migration you get only 200 MByte/s right? >> >> > I wish. My current cluster (not this new one) would storage migrate >> >> > at ~10-15MB/s. Serial latency is the problem, without being able to >> >> > buffer, ESXi waits on an ack for each IO before sending the next. >> >> > Also it submits the migrations in 64kb chunks, unless you get VAAI >> >> working. I think esxi will try and do them in parallel, which will help >> >> as well. >> >> > >> >> >> I think it would be awesome if you get 1000 MByte/s >> >> >> >> >> >> Where is the Bottleneck? >> >> > Latency serialisation, without a buffer, you can't drive the >> >> > devices to 100%. With buffered IO (or high queue depths) I can max out >> >> > the journals. >> >> > >> >> >> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from >> >> >> the P3700. >> >> >> >> >> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-y >> >> >> our -ssd-is-suitable-as-a-journal-device/ >> >> >> >> >> >> How could it be that the rbd client performance is 50% slower? >> >> >> >> >> >> Regards >> >> >> >> >> >> >> >> >> Am 21.07.16 um 12:15 schrieb Nick Fisk: >> >> >>> I've had a lot of pain with this, smaller block sizes are even worse. >> >> >>> You want to try and minimize latency at every point as there is >> >> >>> no buffering happening in the iSCSI stack. This means:- >> >> >>> >> >> >>> 1. Fast journals (NVME or NVRAM) >> >> >>> 2. 10GB or better networking >> >> >>> 3. Fast CPU's (Ghz) >> >> >>> 4. Fix CPU c-state's to C1 >> >> >>> 5. Fix CPU's Freq to max >> >> >>> >> >> >>> Also I can't be sure, but I think there is a metadata update >> >> >>> happening with VMFS, particularly if you are using thin VMDK's, >> >> >>> this can also be a major bottleneck. For my use case, I've >> >> >>> switched over to NFS as it has given much more performance at >> >> >>> scale and >> >> less headache. >> >> >>> >> >> >>> For the RADOS Run, here you go (400GB P3700): >> >> >>> >> >> >>> Total time run: 60.026491 >> >> >>> Total writes made: 3104 >> >> >>> Write size: 4194304 >> >> >>> Object size: 4194304 >> >> >>> Bandwidth (MB/sec): 206.842 >> >> >>> Stddev Bandwidth: 8.10412 >> >> >>> Max bandwidth (MB/sec): 224 >> >> >>> Min bandwidth (MB/sec): 180 >> >> >>> Average IOPS: 51 >> >> >>> Stddev IOPS: 2 >> >> >>> Max IOPS: 56 >> >> >>> Min IOPS: 45 >> >> >>> Average Latency(s): 0.0193366 >> >> >>> Stddev Latency(s): 0.00148039 >> >> >>> Max latency(s): 0.0377946 >> >> >>> Min latency(s): 0.015909 >> >> >>> >> >> >>> Nick >> >> >>> >> >> >>>> -----Original Message----- >> >> >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On >> >> >>>> Behalf Of Horace >> >> >>>> Sent: 21 July 2016 10:26 >> >> >>>> To: w...@globe.de >> >> >>>> Cc: ceph-users@lists.ceph.com >> >> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread >> >> >>>> Performance >> >> >>>> >> >> >>>> Hi, >> >> >>>> >> >> >>>> Same here, I've read some blog saying that vmware will >> >> >>>> frequently verify the locking on VMFS over iSCSI, hence it will have >> >> >>>> much slower performance than NFS (with different >> locking mechanism). >> >> >>>> >> >> >>>> Regards, >> >> >>>> Horace Ng >> >> >>>> >> >> >>>> ----- Original Message ----- >> >> >>>> From: w...@globe.de >> >> >>>> To: ceph-users@lists.ceph.com >> >> >>>> Sent: Thursday, July 21, 2016 5:11:21 PM >> >> >>>> Subject: [ceph-users] Ceph + VMware + Single Thread Performance >> >> >>>> >> >> >>>> Hi everyone, >> >> >>>> >> >> >>>> we see at our cluster relatively slow Single Thread Performance on >> >> >>>> the iscsi Nodes. >> >> >>>> >> >> >>>> >> >> >>>> Our setup: >> >> >>>> >> >> >>>> 3 Racks: >> >> >>>> >> >> >>>> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd >> >> >>>> cache off). >> >> >>>> >> >> >>>> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and >> >> >>>> 6x WD Red 1TB per Data Node as OSD. >> >> >>>> >> >> >>>> Replication = 3 >> >> >>>> >> >> >>>> chooseleaf = 3 type Rack in the crush map >> >> >>>> >> >> >>>> >> >> >>>> We get only ca. 90 MByte/s on the iscsi Gateway Servers with: >> >> >>>> >> >> >>>> rados bench -p rbd 60 write -b 4M -t 1 >> >> >>>> >> >> >>>> >> >> >>>> If we test with: >> >> >>>> >> >> >>>> rados bench -p rbd 60 write -b 4M -t 32 >> >> >>>> >> >> >>>> we get ca. 600 - 700 MByte/s >> >> >>>> >> >> >>>> >> >> >>>> We plan to replace the Samsung SSD with Intel DC P3700 PCIe >> >> >>>> NVM'e for the Journal to get better Single Thread Performance. >> >> >>>> >> >> >>>> Is anyone of you out there who has an Intel P3700 for Journal an >> >> >>>> can give me back test results with: >> >> >>>> >> >> >>>> >> >> >>>> rados bench -p rbd 60 write -b 4M -t 1 >> >> >>>> >> >> >>>> >> >> >>>> Thank you very much !! >> >> >>>> >> >> >>>> Kind Regards !! >> >> >>>> >> >> >>>> _______________________________________________ >> >> >>>> ceph-users mailing list >> >> >>>> ceph-users@lists.ceph.com >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >>>> _______________________________________________ >> >> >>>> ceph-users mailing list >> >> >>>> ceph-users@lists.ceph.com >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> _______________________________________________ >> >> >> ceph-users mailing list >> >> >> ceph-users@lists.ceph.com >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com