Re: [ceph-users] Ceph + VMware + Single Thread Performance

Christian Balzer Mon, 22 Aug 2016 18:22:27 -0700


Hello,


On Mon, 22 Aug 2016 20:34:54 +0100 Nick Fisk wrote:

> > -----Original Message-----
> > From: Christian Balzer [mailto:ch...@gol.com]
> > Sent: 22 August 2016 03:00
> > To: 'ceph-users' <ceph-users@lists.ceph.com>
> > Cc: Nick Fisk <n...@fisk.me.uk>
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > 
> > 
> > Hello,
> > 
> > On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:
> > 
> > >
> > >
> > > > -----Original Message-----
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Christian Balzer
> > > > Sent: 21 August 2016 09:32
> > > > To: ceph-users <ceph-users@lists.ceph.com>
> > > > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > > >
> > > >
> > > > Hello,
> > > >
> > > > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > > >
> > > > > Hi Nick
> > > > >
> > > > > Interested in this comment - "-Dual sockets are probably bad and
> > > > > will impact performance."
> > > > >
> > > > > Have you got real world experience of this being the case?
> > > > >
> > > > Well, Nick wrote "probably".
> > > >
> > > > Dual sockets and thus NUMA, the need for CPUs to talk to each other
> > > > and share information certainly can impact things that are
> > > very
> > > > time critical.
> > > > How much though is a question of design, both HW and SW.
> > >
> > > There was a guy from Redhat (sorry his name escapes me now) a few
> > > months ago on the performance weekly meeting. He was analysing the CPU
> > > cache miss effects with Ceph and it looked like a NUMA setup was
> > > having quite a severe impact on some things. To be honest a lot of it
> > > went over my head, but I came away from it with a general feeling that
> > > if you can get the required performance from 1 socket, then that is 
> > > probably a better bet. This includes only populating a single
> > socket in a dual socket system. There was also a Ceph tech talk at the 
> > start of the year (High perf databases on Ceph) where the guy
> > presenting was also recommending only populating 1 socket for latency 
> > reasons.
> > >
> > I wonder how complete their testing was and how much manual tuning they 
> > tried.
> > As in:
> > 
> > 1. Was irqbalance running?
> > Because it and the normal kernel strategies clash beautifully.
> > Irqbalance moves stuff around, the kernel tries to move things close to 
> > where the IRQs are, cat and mouse.
> > 
> > 2. Did they try with manual IRQ pinning?
> > I do, not that it's critical with my Ceph nodes, but on other machines it 
> > can make a LOT of difference.
> > Like keeping the cores near (or at least on the same NUMA node) as the 
> > network IRQs reserved for KVM vhost processes.
> > 
> > 3. Did they try pining Ceph OSD processes?
> > While this may certainly help (and make things more predictable when the 
> > load gets high), as I said above the kernel normally does a
> > pretty good job of NOT moving things around and keeping processes close to 
> > the resources they need.
> > 
> 
> From what I remember I think they went to pretty long lengths to tune things. 
> I think one point was that if you have a 40GB nic on socket, a NVME on 
> another, no matter where the process runs, you are going to have a lot of 
> traffic crossing between the sockets.

Traffic yes, complete process migrations hopefully not.
But anyway, yes, that's to be expected.

And also unavoidable if you want/need to utilize the whole capabilities
and PCIe lanes of a dual socket motherboard.
And in some cases (usually not with Ceph/OSDs), the IRQ load really will
benefit from more cores to play with.

> 
> Here is the DB on Ceph one
> 
> http://ceph-users.ceph.narkive.com/1sj4VI4U/ceph-tech-talk-high-performance-production-databases-on-ceph

Thanks!
Yeah, basically confirms what I know/said.

> 
> I don't think the recordings are available for the performance meeting one, 
> but it was something to do with certain C++ string functions causing issue 
> with CPU cache. Honestly can't remember much else.
> 
> > > Both of those, coupled with the fact that Xeon E3's are the cheapest way 
> > > to get high clock speeds, sort of made my decision.
> > >
> > Totally agreed, my current HDD node design is based on the single CPU 
> > Supermicro 5028R-E1CR12L barebone, with an E5-1650 v3
> > (3.50GHz) CPU.
> 
> Nice. Any ideas how they compare to the E3's?
> 
Not really, as in direct comparison.
They look good enough on paper and surely perform as advertised.

> > 
> > > >
> > > > We're looking here at a case where he's trying to reduce latency by
> > > > all means and where the actual CPU needs for the HDDs are negligible.
> > > > The idea being that a "Ceph IOPS" stays on one core which is hopefully 
> > > > also not being shared at that time.
> > > >
> > > > If you're looking at full SSD nodes OTOH a singe CPU may very well
> > > > not be able to saturate a sensible amount of SSDs per node, so
> > > a
> > > > slight penalty but better utilization and overall IOPS with 2 CPUs may 
> > > > be the forward.
> > >
> > > Definitely, as always work out what your requirements are and design 
> > > around them.
> > >
> > On my cache tier nodes with 2x E5-2623 v3 (3.00GHz) and currently 4 800GB 
> > DC S3610 SSDs I can already saturate all but 2 "cores", with
> > the "right"
> > extreme test cases.
> > Normal load is of course just around 4 (out of 16) "cores".
> 
> Any idea what sort of IOP's that is? Wondering how it lines up against my 
> estimate of 10000 on my single quad core.

For background and HW, see my recent thread titled:
"Better late than never, some XFS versus EXT4 test results"

Basically something like this, on a KRBD mounted EXT4 image:
--- 
fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=randwrite --name=fiojob --blocksize=4K --iodepth=64
---

This will give us about 8500 IOPS, using about 1100% (of 1200%) for
Ceph/OS, with only 20% (so next to nothing) in WAIT, the SSDs are only 35%
busy.

So with this setup and 15.6 GHz total capacity versus your 4 3.5GHz
cores at 14GHz 10000 might be a bit optimistic (though Jewel may help), but
definitely in the correct ballpark.

Christian
> 
> > 
> > And for the people who like it fast(er) but don't have to deal with VMware 
> > or the likes, instead of forcing the c-state to 1 just setting
> > the governor to "performance" was enough in my case to halve latency (from 
> > about 2 to 1ms).
> 
> Is that also changing the c-state, I'm pretty sure that only effects the 
> frequency.
> 
> > 
> > This still does save some power at times and (as Nick speculated) indeed 
> > allows some cores to use their turbo speeds.
> 
> I did a test on my new boxes and the difference between max power savings and 
> fulle frequency + cstate=1 was less than 10w.
> 
> > 
> > So the 4-5 busy cores on my cache tier nodes tend to hover around 3.3GHz, 
> > instead of the 3.0GHz baseline for their CPUs.
> > And the less loaded cores don't tend to go below 2.6GHz, as opposed to the 
> > 1.2GHz that the "powersave" governor would default to.
> > 
> > Christian
> > 
> > > >
> > > > Christian
> > > >
> > > > > Thanks - B
> > > > >
> > > > > On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> > > > > >> -----Original Message-----
> > > > > >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > > > > >> Sent: 21 August 2016 04:15
> > > > > >> To: Nick Fisk <n...@fisk.me.uk>
> > > > > >> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users
> > > > > >> <ceph-users@lists.ceph.com>
> > > > > >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> > > > > >> Performance
> > > > > >>
> > > > > >> Hi Nick,
> > > > > >>
> > > > > >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> > > > > >> >> -----Original Message-----
> > > > > >> >> From: w...@globe.de [mailto:w...@globe.de]
> > > > > >> >> Sent: 21 July 2016 13:23
> > > > > >> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> > > > > >> >> Cc: ceph-users@lists.ceph.com
> > > > > >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> > > > > >> >> Performance
> > > > > >> >>
> > > > > >> >> Okay and what is your plan now to speed up ?
> > > > > >> >
> > > > > >> > Now I have come up with a lower latency hardware design,
> > > > > >> > there is not much further improvement until persistent RBD
> > > > > >> > caching is
> > > > > >> implemented, as you will be moving the SSD/NVME closer to the
> > > > > >> client. But I'm happy with what I can achieve at the moment. You 
> > > > > >> could also experiment with bcache on the RBD.
> > > > > >>
> > > > > >> Reviving this thread, would you be willing to share the details
> > > > > >> of the low latency hardware design?  Are you optimizing for NFS or 
> > > > > >> iSCSI?
> > > > > >
> > > > > > Both really, just trying to get the write latency as low as
> > > > > > possible, as you know, vmware does everything with lots of
> > > unbuffered
> > > > small io's. Eg when you migrate a VM or as thin vmdk's grow.
> > > > > >
> > > > > > Even storage vmotions which might kick off 32 threads, as they
> > > > > > all roughly fall on the same PG, there still appears to be a
> > > > bottleneck with contention on the PG itself.
> > > > > >
> > > > > > These were the sort of things I was trying to optimise for, to make 
> > > > > > the time spent in Ceph as minimal as possible for each IO.
> > > > > >
> > > > > > So onto the hardware. Through reading various threads and 
> > > > > > experiments on my own I came to the following conclusions.
> > > > > >
> > > > > > -You need highest possible frequency on the CPU cores, which 
> > > > > > normally also means less of them.
> > > > > > -Dual sockets are probably bad and will impact performance.
> > > > > > -Use NVME's for journals to minimise latency
> > > > > >
> > > > > > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5
> > > > > > with an Intel P3700 for a journal. I used the SuperMicro X11SSH-
> > > > CTF board which has 10G-T onboard as well as 8SATA and 8SAS, so no
> > > > expansion cards required. Actually this design as well as being very
> > > > performant for Ceph, also works out very cheap as you are using low
> > > > end server parts. The whole lot + 12x7.2k disks all goes
> > > into
> > > > a 1U case.
> > > > > >
> > > > > > During testing I noticed that by default c-states and p-states
> > > > > > slaughter performance. After forcing max cstate to 1 and
> > > forcing the
> > > > CPU frequency up to max, I was seeing 600us latency for a 4kb write to 
> > > > a 3xreplica pool, or around 1600IOPs, this is at QD=1.
> > > > > >
> > > > > > Few other observations:
> > > > > > 1. Power usage is around 150-200W for this config with 12x7.2k
> > > > > > disks 2. CPU usage maxing out disks, is only around 10-15%, so 
> > > > > > plenty of headroom for more disks.
> > > > > > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage 4.
> > > > > > No idea about CPU load for pure SSD nodes, but based on the
> > > > > > current disks, you could maybe expect ~10000iops per node,
> > > > > > before maxing out CPU's 5. Single NVME seems to be able to
> > > > > > journal 12 disks
> > > > with no problem during normal operation, no doubt a specific benchmark 
> > > > could max it out though.
> > > > > > 6. There are slightly faster Xeon E3's, but price/performance =
> > > > > > diminishing returns
> > > > > >
> > > > > > Hope that answers all your questions.
> > > > > > Nick
> > > > > >
> > > > > >>
> > > > > >> Thank you,
> > > > > >> Alex
> > > > > >>
> > > > > >> >
> > > > > >> >>
> > > > > >> >> Would it help to put in multiple P3700 per OSD Node to
> > > > > >> >> improve performance for a single Thread (example Storage
> > > > > >> >> VMotion)
> > > > ?
> > > > > >> >
> > > > > >> > Most likely not, it's all the other parts of the puzzle which
> > > > > >> > are causing the latency. ESXi was designed for storage arrays
> > > > > >> > that service
> > > > > >> IO's in 100us-1ms range, Ceph is probably about 10x slower than
> > > > > >> this, hence the problem. Disable the BBWC on a RAID controller or 
> > > > > >> SAN and you will the same behaviour.
> > > > > >> >
> > > > > >> >>
> > > > > >> >> Regards
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> > > > > >> >> >> -----Original Message-----
> > > > > >> >> >> From: ceph-users
> > > > > >> >> >> [mailto:ceph-users-boun...@lists.ceph.com]
> > > > > >> >> >> On Behalf Of w...@globe.de
> > > > > >> >> >> Sent: 21 July 2016 13:04
> > > > > >> >> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> > > > > >> >> >> Cc: ceph-users@lists.ceph.com
> > > > > >> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> > > > > >> >> >> Performance
> > > > > >> >> >>
> > > > > >> >> >> Hi,
> > > > > >> >> >>
> > > > > >> >> >> hmm i think 200 MByte/s is really bad. Is your Cluster in 
> > > > > >> >> >> production right now?
> > > > > >> >> > It's just been built, not running yet.
> > > > > >> >> >
> > > > > >> >> >> So if you start a storage migration you get only 200 MByte/s 
> > > > > >> >> >> right?
> > > > > >> >> > I wish. My current cluster (not this new one) would
> > > > > >> >> > storage migrate at ~10-15MB/s. Serial latency is the
> > > > > >> >> > problem, without being able to buffer, ESXi waits on an ack 
> > > > > >> >> > for each IO before sending the next.
> > > > > >> >> > Also it submits the migrations in 64kb chunks, unless you
> > > > > >> >> > get VAAI
> > > > > >> >> working. I think esxi will try and do them in parallel, which 
> > > > > >> >> will help as well.
> > > > > >> >> >
> > > > > >> >> >> I think it would be awesome if you get 1000 MByte/s
> > > > > >> >> >>
> > > > > >> >> >> Where is the Bottleneck?
> > > > > >> >> > Latency serialisation, without a buffer, you can't drive
> > > > > >> >> > the devices to 100%. With buffered IO (or high queue depths) 
> > > > > >> >> > I can max out the journals.
> > > > > >> >> >
> > > > > >> >> >> A FIO Test from Sebastien Han give us 400 MByte/s raw 
> > > > > >> >> >> performance from the P3700.
> > > > > >> >> >>
> > > > > >> >> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-
> > > > > >> >> >> test -if-y our -ssd-is-suitable-as-a-journal-device/
> > > > > >> >> >>
> > > > > >> >> >> How could it be that the rbd client performance is 50% 
> > > > > >> >> >> slower?
> > > > > >> >> >>
> > > > > >> >> >> Regards
> > > > > >> >> >>
> > > > > >> >> >>
> > > > > >> >> >> Am 21.07.16 um 12:15 schrieb Nick Fisk:
> > > > > >> >> >>> I've had a lot of pain with this, smaller block sizes are 
> > > > > >> >> >>> even worse.
> > > > > >> >> >>> You want to try and minimize latency at every point as
> > > > > >> >> >>> there is no buffering happening in the iSCSI stack. This
> > > > > >> >> >>> means:-
> > > > > >> >> >>>
> > > > > >> >> >>> 1. Fast journals (NVME or NVRAM) 2. 10GB or better
> > > > > >> >> >>> networking 3. Fast CPU's (Ghz) 4. Fix CPU c-state's to C1 5.
> > > > > >> >> >>> Fix CPU's Freq to max
> > > > > >> >> >>>
> > > > > >> >> >>> Also I can't be sure, but I think there is a metadata
> > > > > >> >> >>> update happening with VMFS, particularly if you are
> > > > > >> >> >>> using thin VMDK's, this can also be a major bottleneck.
> > > > > >> >> >>> For my use case, I've switched over to NFS as it has
> > > > > >> >> >>> given much more performance at scale and
> > > > > >> >> less headache.
> > > > > >> >> >>>
> > > > > >> >> >>> For the RADOS Run, here you go (400GB P3700):
> > > > > >> >> >>>
> > > > > >> >> >>> Total time run:         60.026491
> > > > > >> >> >>> Total writes made:      3104
> > > > > >> >> >>> Write size:             4194304
> > > > > >> >> >>> Object size:            4194304
> > > > > >> >> >>> Bandwidth (MB/sec):     206.842
> > > > > >> >> >>> Stddev Bandwidth:       8.10412
> > > > > >> >> >>> Max bandwidth (MB/sec): 224 Min bandwidth (MB/sec): 180
> > > > > >> >> >>> Average IOPS:           51
> > > > > >> >> >>> Stddev IOPS:            2
> > > > > >> >> >>> Max IOPS:               56
> > > > > >> >> >>> Min IOPS:               45
> > > > > >> >> >>> Average Latency(s):     0.0193366
> > > > > >> >> >>> Stddev Latency(s):      0.00148039
> > > > > >> >> >>> Max latency(s):         0.0377946
> > > > > >> >> >>> Min latency(s):         0.015909
> > > > > >> >> >>>
> > > > > >> >> >>> Nick
> > > > > >> >> >>>
> > > > > >> >> >>>> -----Original Message-----
> > > > > >> >> >>>> From: ceph-users
> > > > > >> >> >>>> [mailto:ceph-users-boun...@lists.ceph.com]
> > > > > >> >> >>>> On Behalf Of Horace
> > > > > >> >> >>>> Sent: 21 July 2016 10:26
> > > > > >> >> >>>> To: w...@globe.de
> > > > > >> >> >>>> Cc: ceph-users@lists.ceph.com
> > > > > >> >> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> > > > > >> >> >>>> Performance
> > > > > >> >> >>>>
> > > > > >> >> >>>> Hi,
> > > > > >> >> >>>>
> > > > > >> >> >>>> Same here, I've read some blog saying that vmware will
> > > > > >> >> >>>> frequently verify the locking on VMFS over iSCSI, hence
> > > > > >> >> >>>> it will have much slower performance than NFS (with
> > > > > >> >> >>>> different
> > > > > >> locking mechanism).
> > > > > >> >> >>>>
> > > > > >> >> >>>> Regards,
> > > > > >> >> >>>> Horace Ng
> > > > > >> >> >>>>
> > > > > >> >> >>>> ----- Original Message -----
> > > > > >> >> >>>> From: w...@globe.de
> > > > > >> >> >>>> To: ceph-users@lists.ceph.com
> > > > > >> >> >>>> Sent: Thursday, July 21, 2016 5:11:21 PM
> > > > > >> >> >>>> Subject: [ceph-users] Ceph + VMware + Single Thread
> > > > > >> >> >>>> Performance
> > > > > >> >> >>>>
> > > > > >> >> >>>> Hi everyone,
> > > > > >> >> >>>>
> > > > > >> >> >>>> we see at our cluster relatively slow Single Thread 
> > > > > >> >> >>>> Performance on the iscsi Nodes.
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> Our setup:
> > > > > >> >> >>>>
> > > > > >> >> >>>> 3 Racks:
> > > > > >> >> >>>>
> > > > > >> >> >>>> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with 
> > > > > >> >> >>>> tgt (rbd cache off).
> > > > > >> >> >>>>
> > > > > >> >> >>>> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per
> > > > > >> >> >>>> SSD) and 6x WD Red 1TB per Data Node as OSD.
> > > > > >> >> >>>>
> > > > > >> >> >>>> Replication = 3
> > > > > >> >> >>>>
> > > > > >> >> >>>> chooseleaf = 3 type Rack in the crush map
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> We get only ca. 90 MByte/s on the iscsi Gateway Servers 
> > > > > >> >> >>>> with:
> > > > > >> >> >>>>
> > > > > >> >> >>>> rados bench -p rbd 60 write -b 4M -t 1
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> If we test with:
> > > > > >> >> >>>>
> > > > > >> >> >>>> rados bench -p rbd 60 write -b 4M -t 32
> > > > > >> >> >>>>
> > > > > >> >> >>>> we get ca. 600 - 700 MByte/s
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> We plan to replace the Samsung SSD with Intel DC P3700
> > > > > >> >> >>>> PCIe NVM'e for the Journal to get better Single Thread 
> > > > > >> >> >>>> Performance.
> > > > > >> >> >>>>
> > > > > >> >> >>>> Is anyone of you out there who has an Intel P3700 for
> > > > > >> >> >>>> Journal an can give me back test results with:
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> rados bench -p rbd 60 write -b 4M -t 1
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> Thank you very much !!
> > > > > >> >> >>>>
> > > > > >> >> >>>> Kind Regards !!
> > > > > >> >> >>>>
> > > > > >> >> >>>> _______________________________________________
> > > > > >> >> >>>> ceph-users mailing list ceph-users@lists.ceph.com
> > > > > >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >> >> >>>> _______________________________________________
> > > > > >> >> >>>> ceph-users mailing list ceph-users@lists.ceph.com
> > > > > >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >> >> >> _______________________________________________
> > > > > >> >> >> ceph-users mailing list
> > > > > >> >> >> ceph-users@lists.ceph.com
> > > > > >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >> >
> > > > > >> >
> > > > > >> > _______________________________________________
> > > > > >> > ceph-users mailing list
> > > > > >> > ceph-users@lists.ceph.com
> > > > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@lists.ceph.com
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > >
> > > >
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > ch...@gol.com           Global OnLine Japan/Rakuten Communications
> > > > http://www.gol.com/
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > 
> > 
> > --
> > Christian Balzer        Network/Systems Engineer
> > ch...@gol.com       Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMware + Single Thread Performance

Reply via email to