HI Nick,
On Fri, Jul 1, 2016 at 2:11 PM, Nick Fisk <n...@fisk.me.uk> wrote: <snip> > However, there are a number of pain points with iSCSI + ESXi + RBD and they > all mainly centre on write latency. It seems VMFS was designed around the > fact that Enterprise storage arrays service writes in 10-100us, whereas Ceph > will service them in 2-10ms. > > 1. Thin Provisioning makes things slow. I believe the main cause is that when > growing and zeroing the new blocks, metadata needs to be updated and the > block zero'd. Both issue small IO which would normally not be a problem, but > with Ceph it becomes a bottleneck to overall IO on the datastore. > > 2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN > will coalesce these back into a stream of larger IO's before committing to > disk. However with Ceph each IO takes 2-10ms and so everything seems slow. > The future feature of persistent RBD cache may go a long way to helping with > this. Are you referring to ESXi snapshots? Specifically, if a VM is running off a snapshot (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1015180), its IO will drop to 64KB "grains"? > 3. >2TB VMDK's with snapshots use a different allocation mode, which happens > in 4kb chunks instead of 64kb ones. This makes the problem 16 times worse > than above. > > 4. Any of the above will also apply when migrating machines around, so VM's > can takes hours/days to move. > > 5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO, > you get thin provisioning, but no pagecache or readahead, so performance can > nose dive if this is needed. Would not FILEIO also leverage the Linux scheduler to do IO coalescing and help with (2) ? Since FILEIO also uses the dirty flush mechanism in page cache (and makes IO somewhat crash-unsafe at the same time). > 6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to > seeing APD/PDL even when you think you have finally got everything working > great. We were used to seeing APD/PDL all the time with LIO, but pretty much have not seen any with SCST > 3.1. Most of the ESXi problems are with just with high latency periods, which are not a problem for the hypervisor itself, but rather for the databases or applications inside VMs. Thanks, Alex > > > Normal IO from eager zeroed VM's with no snapshots, however should perform > ok. So depends what your workload is. > > > And then comes NFS. It's very easy to setup, very easy to configure for HA, > and works pretty well overall. You don't seem to get any of the IO size > penalties when using snapshots. If you mount with discard, thin provisioning > is done by Ceph. You can defragment the FS on the proxy node and several > other things that you can't do with VMFS. Just make sure you run the server > in sync mode to avoid data loss. > > The only downside is that every IO causes an IO to the FS and one to the FS > journal, so you effectively double your IO. But if your Ceph backend can > support it, then it shouldn't be too much of a problem. > > Now to the original poster, assuming the iSCSI node is just kernel mounting > the RBD, I would run iostat on it, to try and see what sort of latency you > are seeing at that point. Also do the same with esxtop +u, and look at the > write latency there, both whilst running the fio in the VM. This should > hopefully let you see if there is just a gradual increase as you go from hop > to hop or if there is an obvious culprit. > > Can you also confirm your kernel version? > > With 1GB networking I think you will struggle to get your write latency much > below 10-15ms, but from your example ~30ms is still a bit high. I wonder if > the default queue depths on your iSCSI target are too low as well? > > Nick > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Oliver Dzombic >> Sent: 01 July 2016 09:27 >> To: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] >> suse_enterprise_storage3_rbd_LIO_vmware_performance_bad >> >> Hi, >> >> my experience: >> >> ceph + iscsi ( multipath ) + vmware == worst >> >> Better you search for another solution. >> >> vmware + nfs + vmware might have a much better performance. >> >> -------- >> >> If you are able to get vmware run with iscsi and ceph, i would be >> >>very<< intrested in what/how you did that. >> >> -- >> Mit freundlichen Gruessen / Best regards >> >> Oliver Dzombic >> IP-Interactive >> >> mailto:i...@ip-interactive.de >> >> Anschrift: >> >> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 >> 63571 Gelnhausen >> >> HRB 93402 beim Amtsgericht Hanau >> Geschäftsführung: Oliver Dzombic >> >> Steuer Nr.: 35 236 3622 1 >> UST ID: DE274086107 >> >> >> Am 01.07.2016 um 07:04 schrieb mq: >> > Hi list >> > I have tested suse enterprise storage3 using 2 iscsi gateway attached >> > to vmware. The performance is bad. I have turn off VAAI following >> > the >> > >> (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US >> > &cmd=displayKC&externalId=1033665) >> > >> <https://kb.vmware.com/selfservice/microsites/search.do?language=en_U >> S&cmd=displayKC&externalId=1033665%29>. >> > My cluster >> > 3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps (3*10K SAS, 1*480G SSD) per >> > node, SSD as journal >> > 1 vmware node 2*E5-2620 64G , mem 2*1Gbps >> > >> > # ceph -s >> > cluster 0199f68d-a745-4da3-9670-15f2981e7a15 >> > health HEALTH_OK >> > monmap e1: 3 mons at >> > >> {node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.5 >> 0.93:6789/0} >> > election epoch 22, quorum 0,1,2 node1,node2,node3 >> > osdmap e200: 9 osds: 9 up, 9 in >> > flags sortbitwise >> > pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects >> > 18339 MB used, 5005 GB / 5023 GB avail >> > 448 active+clean >> > client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr >> > >> > sudo ceph osd tree >> > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >> > -1 4.90581 root default >> > -2 1.63527 host node1 >> > 0 0.54509 osd.0 up 1.00000 1.00000 >> > 1 0.54509 osd.1 up 1.00000 1.00000 >> > 2 0.54509 osd.2 up 1.00000 1.00000 >> > -3 1.63527 host node2 >> > 3 0.54509 osd.3 up 1.00000 1.00000 >> > 4 0.54509 osd.4 up 1.00000 1.00000 >> > 5 0.54509 osd.5 up 1.00000 1.00000 >> > -4 1.63527 host node3 >> > 6 0.54509 osd.6 up 1.00000 1.00000 >> > 7 0.54509 osd.7 up 1.00000 1.00000 >> > 8 0.54509 osd.8 up 1.00000 1.00000 >> > >> > >> > >> > An linux vm in vmmare, running fio. 4k randwrite result just 64 IOPS >> > lantency is high,dd test just 11MB/s. >> > >> > fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite -size=100G >> > -filename=/dev/sdb -name="EBS 4KB randwrite test" -iodepth=32 >> > -runtime=60 EBS 4KB randwrite test: (g=0): rw=randwrite, >> > bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 >> > fio-2.0.13 >> > Starting 1 thread >> > Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0 iops] [eta >> > 00m:00s] EBS 4KB randwrite test: (groupid=0, jobs=1): err= 0: >> > pid=6766: Wed Jun >> > 29 21:28:06 2016 >> > write: io=15696KB, bw=264627 B/s, iops=64 , runt= 60737msec >> > slat (usec): min=10 , max=213 , avg=35.54, stdev=16.41 >> > clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52 >> > lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52 >> > clat percentiles (msec): >> > | 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9], >> > | 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 198], 60.00th=[ 204], >> > | 70.00th=[ 208], 80.00th=[ 217], 90.00th=[ 799], 95.00th=[ 1795], >> > | 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712], 99.95th=[16712], >> > | 99.99th=[16712] >> > bw (KB/s) : min= 36, max=11960, per=100.00%, avg=264.77, >> > stdev=1110.81 >> > lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%, 50=0.03% >> > lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%, 1000=1.35% >> > lat (msec) : 2000=4.03%, >=2000=4.77% >> > cpu : usr=0.02%, sys=0.22%, ctx=2973, majf=0, >> > minf=18446744073709538907 >> > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%, >> >>=64=0.0% >> > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >>=64=0.0% >> > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >> >>=64=0.0% >> > issued : total=r=0/w=3924/d=0, short=r=0/w=0/d=0 >> > >> > Run status group 0 (all jobs): >> > WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s, >> > mint=60737msec, maxt=60737msec >> > >> > Disk stats (read/write): >> > sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694, >> > util=100.00% >> > >> > anyone can give me some suggestion to improve the performance ? >> > >> > Regards >> > >> > MQ >> > >> > >> > >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com