On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk <n...@fisk.me.uk> wrote: > > > > > *From:* Alex Gorbachev [mailto:a...@iss-integration.com] > *Sent:* 04 September 2016 04:45 > *To:* Nick Fisk <n...@fisk.me.uk> > *Cc:* Wilhelm Redbrake <w...@globe.de>; Horace Ng <hor...@hkisl.net>; > ceph-users <ceph-users@lists.ceph.com> > *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance > > > > > > On Saturday, September 3, 2016, Alex Gorbachev <a...@iss-integration.com> > wrote: > > HI Nick, > > On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <n...@fisk.me.uk> wrote: > > *From:* Alex Gorbachev [mailto:a...@iss-integration.com] > *Sent:* 21 August 2016 15:27 > *To:* Wilhelm Redbrake <w...@globe.de> > *Cc:* n...@fisk.me.uk; Horace Ng <hor...@hkisl.net>; ceph-users < > ceph-users@lists.ceph.com> > *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance > > > > > > On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de> wrote: > > Hi Nick, > i understand all of your technical improvements. > But: why do you Not use a simple for example Areca Raid Controller with 8 > gb Cache and Bbu ontop in every ceph node. > Configure n Times RAID 0 on the Controller and enable Write back Cache. > That must be a latency "Killer" like in all the prop. Storage arrays or > Not ?? > > Best Regards !! > > > > What we saw specifically with Areca cards is that performance is excellent > in benchmarking and for bursty loads. However, once we started loading with > more constant workloads (we replicate databases and files to our Ceph > cluster), this looks to have saturated the relatively small Areca NVDIMM > caches and we went back to pure drive based performance. > > > > Yes, I think that is a valid point. Although low latency, you are still > having to write to the disks twice (journal+data), so once the cache’s on > the cards start filling up, you are going to hit problems. > > > > > > So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per > 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That > worked, but now the overall latency is really high at times, not always. > Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS > drives with too many IOPS, which get their latency sky high. Overall we are > functioning fine, but I sure would like storage vmotion and other large > operations faster. > > > > > > Yeah this is the biggest pain point I think. Normal VM ops are fine, but > if you ever have to move a multi-TB VM, it’s just too slow. > > > > If you use iscsi with vaai and are migrating a thick provisioned vmdk, > then performance is actually quite good, as the block sizes used for the > copy are a lot bigger. > > > > However, my use case required thin provisioned VM’s + snapshots and I > found that using iscsi you have no control over the fragmentation of the > vmdk’s and so the read performance is then what suffers (certainly with > 7.2k disks) > > > > Also with thin provisioned vmdk’s I think I was seeing PG contention with > the updating of the VMFS metadata, although I can’t be sure. > > > > > > I am thinking I will test a few different schedulers and readahead > settings to see if we can improve this by parallelizing reads. Also will > test NFS, but need to determine whether to do krbd/knfsd or something more > interesting like CephFS/Ganesha. > > > > As you know I’m on NFS now. I’ve found it a lot easier to get going and a > lot less sensitive to making config adjustments without suddenly everything > dropping offline. The fact that you can specify the extent size on XFS > helps massively with using thin vmdks/snapshots to avoid fragmentation. > Storage v-motions are a bit faster than iscsi, but I think I am hitting PG > contention when esxi tries to write 32 copy threads to the same object. > There is probably some tuning that could be done here (RBD striping???) but > this is the best it’s been for a long time and I’m reluctant to fiddle any > further. > > > > We have moved ahead and added NFS support to Storcium, and now able ti run > NFS servers with Pacemaker in HA mode (all agents are public at > https://github.com/akurz/resource-agents/tree/master/heartbeat > <http://xo4t.mj.am/lnk/AEMAFOTiMP4AAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/1/SaDNCfweUSbAAalNO6TCqg/aHR0cHM6Ly9naXRodWIuY29tL2FrdXJ6L3Jlc291cmNlLWFnZW50cy90cmVlL21hc3Rlci9oZWFydGJlYXQ>). > I can confirm that VM performance is definitely better and benchmarks are > more smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is > choppy on writes, but smooth on reads, likely due to the bursty nature of > OSD filesystems when dealing with that small IO size). > > > > Were you using extsz=16384 at creation time for the filesystem? I saw > kernel memory deadlock messages during vmotion, such as: > > > > XFS: nfsd(102545) possible memory allocation deadlock size 40320 in > kmem_alloc (mode:0x2400240) > > > > And analyzing fragmentation: > > > > root@roc-5r-scd218:~# xfs_db -r /dev/rbd21 > > xfs_db> frag -d > > actual 0, ideal 0, fragmentation factor 0.00% > > xfs_db> frag -f > > actual 1863960, ideal 74, fragmentation factor 100.00% > > > > Just from two vmotions. > > > > Are you seeing anything similar? > > > > Found your post on setting XFS extent size hint for sparse files: > > > > xfs_io -c extsize 16M /mountpoint > > Will test - fragmentation definitely present without this. > > > > Yeah I got bit by that when I 1st set it up, I then created another > datastore with that ext hint and moved everything across. Haven’t seen any > kmem alloc errors since and sequential Read performance is a lot better > than thin provisioned iscsi. >
Nick, what mount options do you use? Defaults (with the extsize hint above) work well, but getting occasional load spikes due to IO wait. Is logbsize=256k beneficial in sync mode? Here is what I am using: root@roc-5r-scd218:/var/log/atop# xfs_info /dev/rbd18 meta-data=/dev/rbd18 isize=256 agcount=33, agsize=16776192 blks = sectsz=512 attr=2 data = bsize=4096 blocks=536870912, imaxpct=5 = sunit=1024 swidth=1024 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=262144, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Thank you, Alex > > > > > > > Thank you, > > Alex > > > > > > But as mentioned above, thick vmdk’s with vaai might be a really good fit. > > > > Thanks for your very valuable info on analysis and hw build. > > > > Alex > > > > > > > Am 21.08.2016 um 09:31 schrieb Nick Fisk <n...@fisk.me.uk>: > > >> -----Original Message----- > >> From: Alex Gorbachev [mailto:a...@iss-integration.com] > >> Sent: 21 August 2016 04:15 > >> To: Nick Fisk <n...@fisk.me.uk> > >> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users < > ceph-users@lists.ceph.com> > >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance > >> > >> Hi Nick, > >> > >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote: > >>>> -----Original Message----- > >>>> From: w...@globe.de [mailto:w...@globe.de] > >>>> Sent: 21 July 2016 13:23 > >>>> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net> > >>>> Cc: ceph-users@lists.ceph.com > >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance > >>>> > >>>> Okay and what is your plan now to speed up ? > >>> > >>> Now I have come up with a lower latency hardware design, there is not > much further improvement until persistent RBD caching is > >> implemented, as you will be moving the SSD/NVME closer to the client. > But I'm happy with what I can achieve at the moment. You > >> could also experiment with bcache on the RBD. > >> > >> Reviving this thread, would you be willing to share the details of the > low latency hardware design? Are you optimizing for NFS or > >> iSCSI? > > > > Both really, just trying to get the write latency as low as possible, as > you know, vmware does everything with lots of unbuffered small io's. Eg > when you migrate a VM or as thin vmdk's grow. > > > > Even storage vmotions which might kick off 32 threads, as they all > roughly fall on the same PG, there still appears to be a bottleneck with > contention on the PG itself. > > > > These were the sort of things I was trying to optimise for, to make the > time spent in Ceph as minimal as possible for each IO. > > > > So onto the hardware. Through reading various threads and experiments on > my own I came to the following conclusions. > > > > -You need highest possible frequency on the CPU cores, which normally > also means less of them. > > -Dual sockets are probably bad and will impact performance. > > -Use NVME's for journals to minimise latency > > > > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an > Intel P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has > 10G-T onboard as well as 8SATA and 8SAS, so no expansion cards required. > Actually this design as well as being very performant for Ceph, also works > out very cheap as you are using low end server parts. The whole lot + > 12x7.2k disks all goes into a 1U case. > > > > During testing I noticed that by default c-states and p-states slaughter > performance. After forcing max cstate to 1 and forcing the CPU frequency up > to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or > around 1600IOPs, this is at QD=1. > > > > Few other observations: > > 1. Power usage is around 150-200W for this config with 12x7.2k disks > > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of > headroom for more disks. > > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage > > 4. No idea about CPU load for pure SSD nodes, but based on the current > disks, you could maybe expect ~10000iops per node, before maxing out CPU's > > 5. Single NVME seems to be able to journal 12 disks with no problem > during normal operation, no doubt a specific benchmark could max it out > though. > > 6. There are slightly faster Xeon E3's, but price/performance = > diminishing returns > > > > Hope that answers all your questions. > > Nick > > > >> > >> Thank you, > >> Alex > >> > >>> > >>>> > >>>> Would it help to put in multiple P3700 per OSD Node to improve > performance for a single Thread (example Storage VMotion) ? > >>> > >>> Most likely not, it's all the other parts of the puzzle which are > causing the latency. ESXi was designed for storage arrays that service > >> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, > hence the problem. Disable the BBWC on a RAID controller or > >> SAN and you will the same behaviour. > >>> > >>>> > >>>> Regards > >>>> > >>>> > >>>> Am 21.07.16 um 14:17 schrieb Nick Fisk: > >>>>>> -----Original Message----- > >>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On > >>>>>> Behalf Of w...@globe.de > >>>>>> Sent: 21 July 2016 13:04 > >>>>>> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net> > >>>>>> Cc: ceph-users@lists.ceph.com > >>>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread > >>>>>> Performance > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> hmm i think 200 MByte/s is really bad. Is your Cluster in > production right now? > >>>>> It's just been built, not running yet. > >>>>> > >>>>>> So if you start a storage migration you get only 200 MByte/s right? > >>>>> I wish. My current cluster (not this new one) would storage migrate > >>>>> at ~10-15MB/s. Serial latency is the problem, without being able to > >>>>> buffer, ESXi waits on an ack for each IO before sending the next. > >>>>> Also it submits the migrations in 64kb chunks, unless you get VAAI > >>>> working. I think esxi will try and do them in parallel, which will > help as well. > >>>>> > >>>>>> I think it would be awesome if you get 1000 MByte/s > >>>>>> > >>>>>> Where is the Bottleneck? > >>>>> Latency serialisation, without a buffer, you can't drive the > >>>>> devices to 100%. With buffered IO (or high queue depths) I can max > out the journals. > >>>>> > >>>>>> A FIO Test from Sebastien Han give us 400 MByte/s raw performance > from the P3700. > >>>>>> > >>>>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-y > <http://xo4t.mj.am/lnk/AEMAFOTiMP4AAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/2/jYWOJ2YT0IyWoJVm3QngEA/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFGQjl2R1lRQUFBQUFBQUFBQUZoTmtqWUFBRE5KQld3QUFBQUFBQUNSWHdCWHVmNnhXTHZydHZwSVRiaVZtalFRa0hZMzFnQUFsQkkvMS93bmxjcVBzNFVMTWtTdVFWN1RJQkhBL2FIUjBjSE02THk5M2QzY3VjMlZpWVhOMGFXVnVMV2hoYmk1bWNpOWliRzluTHpJd01UUXZNVEF2TVRBdlkyVndhQzFvYjNjdGRHOHRkR1Z6ZEMxcFppMTU> > >>>>>> our -ssd-is-suitable-as-a-journal-device/ > >>>>>> > >>>>>> How could it be that the rbd client performance is 50% slower? > >>>>>> > >>>>>> Regards > >>>>>> > >>>>>> > >>>>>>> Am 21.07.16 um 12:15 schrieb Nick Fisk: > >>>>>>> I've had a lot of pain with this, smaller block sizes are even > worse. > >>>>>>> You want to try and minimize latency at every point as there is > >>>>>>> no buffering happening in the iSCSI stack. This means:- > >>>>>>> > >>>>>>> 1. Fast journals (NVME or NVRAM) > >>>>>>> 2. 10GB or better networking > >>>>>>> 3. Fast CPU's (Ghz) > >>>>>>> 4. Fix CPU c-state's to C1 > >>>>>>> 5. Fix CPU's Freq to max > >>>>>>> > >>>>>>> Also I can't be sure, but I think there is a metadata update > >>>>>>> happening with VMFS, particularly if you are using thin VMDK's, > >>>>>>> this can also be a major bottleneck. For my use case, I've > >>>>>>> switched over to NFS as it has given much more performance at > >>>>>>> scale and > >>>> less headache. > >>>>>>> > >>>>>>> For the RADOS Run, here you go (400GB P3700): > >>>>>>> > >>>>>>> Total time run: 60.026491 > >>>>>>> Total writes made: 3104 > >>>>>>> Write size: 4194304 > >>>>>>> Object size: 4194304 > >>>>>>> Bandwidth (MB/sec): 206.842 > >>>>>>> Stddev Bandwidth: 8.10412 > >>>>>>> Max bandwidth (MB/sec): 224 > >>>>>>> Min bandwidth (MB/sec): 180 > >>>>>>> Average IOPS: 51 > >>>>>>> Stddev IOPS: 2 > >>>>>>> Max IOPS: 56 > >>>>>>> Min IOPS: 45 > >>>>>>> Average Latency(s): 0.0193366 > >>>>>>> Stddev Latency(s): 0.00148039 > >>>>>>> Max latency(s): 0.0377946 > >>>>>>> Min latency(s): 0.015909 > >>>>>>> > >>>>>>> Nick > >>>>>>> > >>>>>>>> -----Original Message----- > >>>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On > >>>>>>>> Behalf Of Horace > >>>>>>>> Sent: 21 July 2016 10:26 > >>>>>>>> To: w...@globe.de > >>>>>>>> Cc: ceph-users@lists.ceph.com > >>>>>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread > >>>>>>>> Performance > >>>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> Same here, I've read some blog saying that vmware will > >>>>>>>> frequently verify the locking on VMFS over iSCSI, hence it will > have much slower performance than NFS (with different > >> locking mechanism). > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Horace Ng > >>>>>>>> > >>>>>>>> ----- Original Message ----- > >>>>>>>> From: w...@globe.de > >>>>>>>> To: ceph-users@lists.ceph.com > >>>>>>>> Sent: Thursday, July 21, 2016 5:11:21 PM > >>>>>>>> Subject: [ceph-users] Ceph + VMware + Single Thread Performance > >>>>>>>> > >>>>>>>> Hi everyone, > >>>>>>>> > >>>>>>>> we see at our cluster relatively slow Single Thread Performance > on the iscsi Nodes. > >>>>>>>> > >>>>>>>> > >>>>>>>> Our setup: > >>>>>>>> > >>>>>>>> 3 Racks: > >>>>>>>> > >>>>>>>> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd > cache off). > >>>>>>>> > >>>>>>>> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and > >>>>>>>> 6x WD Red 1TB per Data Node as OSD. > >>>>>>>> > >>>>>>>> Replication = 3 > >>>>>>>> > >>>>>>>> chooseleaf = 3 type Rack in the crush map > >>>>>>>> > >>>>>>>> > >>>>>>>> We get only ca. 90 MByte/s on the iscsi Gateway Servers with: > >>>>>>>> > >>>>>>>> rados bench -p rbd 60 write -b 4M -t 1 > >>>>>>>> > >>>>>>>> > >>>>>>>> If we test with: > >>>>>>>> > >>>>>>>> rados bench -p rbd 60 write -b 4M -t 32 > >>>>>>>> > >>>>>>>> we get ca. 600 - 700 MByte/s > >>>>>>>> > >>>>>>>> > >>>>>>>> We plan to replace the Samsung SSD with Intel DC P3700 PCIe > >>>>>>>> NVM'e for the Journal to get better Single Thread Performance. > >>>>>>>> > >>>>>>>> Is anyone of you out there who has an Intel P3700 for Journal an > >>>>>>>> can give me back test results with: > >>>>>>>> > >>>>>>>> > >>>>>>>> rados bench -p rbd 60 write -b 4M -t 1 > >>>>>>>> > >>>>>>>> > >>>>>>>> Thank you very much !! > >>>>>>>> > >>>>>>>> Kind Regards !! > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> ceph-users mailing list > >>>>>>>> ceph-users@lists.ceph.com > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://xo4t.mj.am/lnk/AEMAFOTiMP4AAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/3/98HYUuqVln2iNMjMDmuIsg/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFGQjl2R1lRQUFBQUFBQUFBQUZoTmtqWUFBRE5KQld3QUFBQUFBQUNSWHdCWHVmNnhXTHZydHZwSVRiaVZtalFRa0hZMzFnQUFsQkkvMi9PanM3X0k0bl9OMzZvQ1poTGtlNVFRL2FIUjBjRG92TDJ4cGMzUnpMbU5sY0dndVkyOXRMMnhwYzNScGJtWnZMbU5uYVM5alpYQm9MWFZ6WlhKekxXTmxjR2d1WTI5dA> > >>>>>>>> _______________________________________________ > >>>>>>>> ceph-users mailing list > >>>>>>>> ceph-users@lists.ceph.com > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://xo4t.mj.am/lnk/AEMAFOTiMP4AAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/4/vI-Z2R4zZuoAheNBuw-v-g/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFGQjl2R1lRQUFBQUFBQUFBQUZoTmtqWUFBRE5KQld3QUFBQUFBQUNSWHdCWHVmNnhXTHZydHZwSVRiaVZtalFRa0hZMzFnQUFsQkkvMy9UMklZeEpNNW82dVJUbXVkU1FmcGV3L2FIUjBjRG92TDJ4cGMzUnpMbU5sY0dndVkyOXRMMnhwYzNScGJtWnZMbU5uYVM5alpYQm9MWFZ6WlhKekxXTmxjR2d1WTI5dA> > >>>>>> _______________________________________________ > >>>>>> ceph-users mailing list > >>>>>> ceph-users@lists.ceph.com > >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://xo4t.mj.am/lnk/AEMAFOTiMP4AAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/5/-JrZXW4TXXuPxpyOBXYAqA/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFGQjl2R1lRQUFBQUFBQUFBQUZoTmtqWUFBRE5KQld3QUFBQUFBQUNSWHdCWHVmNnhXTHZydHZwSVRiaVZtalFRa0hZMzFnQUFsQkkvNC9PcXpyZzJzNUNodVFoY3lxOWFZR0dnL2FIUjBjRG92TDJ4cGMzUnpMbU5sY0dndVkyOXRMMnhwYzNScGJtWnZMbU5uYVM5alpYQm9MWFZ6WlhKekxXTmxjR2d1WTI5dA> > >>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://xo4t.mj.am/lnk/AEMAFOTiMP4AAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/6/L5Q_WynojVul_hB2E7Vkhg/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFGQjl2R1lRQUFBQUFBQUFBQUZoTmtqWUFBRE5KQld3QUFBQUFBQUNSWHdCWHVmNnhXTHZydHZwSVRiaVZtalFRa0hZMzFnQUFsQkkvNS9GMjNCWjRvWXhPTVp4T2haZUZIcnVRL2FIUjBjRG92TDJ4cGMzUnpMbU5sY0dndVkyOXRMMnhwYzNScGJtWnZMbU5uYVM5alpYQm9MWFZ6WlhKekxXTmxjR2d1WTI5dA> > > > > > > -- > > -- > > Alex Gorbachev > > Storcium > > > > > [image: Image removed by sender.] > > > > > > -- > > -- > > Alex Gorbachev > > Storcium > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com