HI Nick,

On Fri, Jul 1, 2016 at 2:11 PM, Nick Fisk <n...@fisk.me.uk> wrote:

<snip>

> However, there are a number of pain points with iSCSI + ESXi + RBD and they 
> all mainly centre on write latency. It seems VMFS was designed around the 
> fact that Enterprise storage arrays service writes in 10-100us, whereas Ceph 
> will service them in 2-10ms.
>
> 1. Thin Provisioning makes things slow. I believe the main cause is that when 
> growing and zeroing the new blocks, metadata needs to be updated and the 
> block zero'd. Both issue small IO which would normally not be a problem, but 
> with Ceph it becomes a bottleneck to overall IO on the datastore.
>
> 2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN 
> will coalesce these back into a stream of larger IO's before committing to 
> disk. However with Ceph each IO takes 2-10ms and so everything seems slow. 
> The future feature of persistent RBD cache may go a long way to helping with 
> this.

Are you referring to ESXi snapshots?  Specifically, if a VM is running
off a snapshot 
(https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1015180),
its IO will drop to 64KB "grains"?

> 3. >2TB VMDK's with snapshots use a different allocation mode, which happens 
> in 4kb chunks instead of 64kb ones. This makes the problem 16 times worse 
> than above.
>
> 4. Any of the above will also apply when migrating machines around, so VM's 
> can takes hours/days to move.
>
> 5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO, 
> you get thin provisioning, but no pagecache or readahead, so performance can 
> nose dive if this is needed.

Would not FILEIO also leverage the Linux scheduler to do IO coalescing
and help with (2) ?  Since FILEIO also uses the dirty flush mechanism
in page cache (and makes IO somewhat crash-unsafe at the same time).

> 6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to 
> seeing APD/PDL even when you think you have finally got everything working 
> great.

We were used to seeing APD/PDL all the time with LIO, but pretty much
have not seen any with SCST > 3.1.  Most of the ESXi problems are with
just with high latency periods, which are not a problem for the
hypervisor itself, but rather for the databases or applications inside
VMs.

Thanks,
Alex

>
>
> Normal IO from eager zeroed VM's with no snapshots, however should perform 
> ok. So depends what your workload is.
>
>
> And then comes NFS. It's very easy to setup, very easy to configure for HA, 
> and works pretty well overall. You don't seem to get any of the IO size 
> penalties when using snapshots. If you mount with discard, thin provisioning 
> is done by Ceph. You can defragment the FS on the proxy node and several 
> other things that you can't do with VMFS. Just make sure you run the server 
> in sync mode to avoid data loss.
>
> The only downside is that every IO causes an IO to the FS and one to the FS 
> journal, so you effectively double your IO. But if your Ceph backend can 
> support it, then it shouldn't be too much of a problem.
>
> Now to the original poster, assuming the iSCSI node is just kernel mounting 
> the RBD, I would run iostat on it, to try and see what sort of latency you 
> are seeing at that point. Also do the same with esxtop +u, and look at the 
> write latency there, both whilst running the fio in the VM. This should 
> hopefully let you see if there is just a gradual increase as you go from hop 
> to hop or if there is an obvious culprit.
>
> Can you also confirm your kernel version?
>
> With 1GB networking I think you will struggle to get your write latency much 
> below 10-15ms, but from your example ~30ms is still a bit high. I wonder if 
> the default queue depths on your iSCSI target are too low as well?
>
> Nick
>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Oliver Dzombic
>> Sent: 01 July 2016 09:27
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users]
>> suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
>>
>> Hi,
>>
>> my experience:
>>
>> ceph + iscsi ( multipath ) + vmware == worst
>>
>> Better you search for another solution.
>>
>> vmware + nfs + vmware might have a much better performance.
>>
>> --------
>>
>> If you are able to get vmware run with iscsi and ceph, i would be
>> >>very<< intrested in what/how you did that.
>>
>> --
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:i...@ip-interactive.de
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 01.07.2016 um 07:04 schrieb mq:
>> > Hi list
>> > I have tested suse enterprise storage3 using 2 iscsi  gateway attached
>> > to  vmware. The performance is bad.  I have turn off  VAAI following
>> > the
>> >
>> (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US
>> > &cmd=displayKC&externalId=1033665)
>> >
>> <https://kb.vmware.com/selfservice/microsites/search.do?language=en_U
>> S&cmd=displayKC&externalId=1033665%29>.
>> > My cluster
>> > 3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps (3*10K SAS, 1*480G  SSD) per
>> > node, SSD as journal
>> > 1 vmware node  2*E5-2620 64G , mem 2*1Gbps
>> >
>> > # ceph -s
>> >     cluster 0199f68d-a745-4da3-9670-15f2981e7a15
>> >      health HEALTH_OK
>> >      monmap e1: 3 mons at
>> >
>> {node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.5
>> 0.93:6789/0}
>> >             election epoch 22, quorum 0,1,2 node1,node2,node3
>> >      osdmap e200: 9 osds: 9 up, 9 in
>> >             flags sortbitwise
>> >       pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects
>> >             18339 MB used, 5005 GB / 5023 GB avail
>> >                  448 active+clean
>> >   client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr
>> >
>> > sudo ceph osd tree
>> > ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> > -1 4.90581 root default
>> > -2 1.63527     host node1
>> > 0 0.54509         osd.0       up  1.00000          1.00000
>> > 1 0.54509         osd.1       up  1.00000          1.00000
>> > 2 0.54509         osd.2       up  1.00000          1.00000
>> > -3 1.63527     host node2
>> > 3 0.54509         osd.3       up  1.00000          1.00000
>> > 4 0.54509         osd.4       up  1.00000          1.00000
>> > 5 0.54509         osd.5       up  1.00000          1.00000
>> > -4 1.63527     host node3
>> > 6 0.54509         osd.6       up  1.00000          1.00000
>> > 7 0.54509         osd.7       up  1.00000          1.00000
>> > 8 0.54509         osd.8       up  1.00000          1.00000
>> >
>> >
>> >
>> > An linux vm in vmmare, running fio.  4k randwrite result just 64 IOPS
>> > lantency is high,dd test just 11MB/s.
>> >
>> > fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite -size=100G
>> > -filename=/dev/sdb  -name="EBS 4KB randwrite test" -iodepth=32
>> > -runtime=60 EBS 4KB randwrite test: (g=0): rw=randwrite,
>> > bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
>> > fio-2.0.13
>> > Starting 1 thread
>> > Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0  iops] [eta
>> > 00m:00s] EBS 4KB randwrite test: (groupid=0, jobs=1): err= 0:
>> > pid=6766: Wed Jun
>> > 29 21:28:06 2016
>> >   write: io=15696KB, bw=264627 B/s, iops=64 , runt= 60737msec
>> >     slat (usec): min=10 , max=213 , avg=35.54, stdev=16.41
>> >     clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52
>> >      lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52
>> >     clat percentiles (msec):
>> >      |  1.00th=[    7],  5.00th=[    8], 10.00th=[    8], 20.00th=[    9],
>> >      | 30.00th=[    9], 40.00th=[   10], 50.00th=[  198], 60.00th=[  204],
>> >      | 70.00th=[  208], 80.00th=[  217], 90.00th=[  799], 95.00th=[ 1795],
>> >      | 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712], 99.95th=[16712],
>> >      | 99.99th=[16712]
>> >     bw (KB/s)  : min=   36, max=11960, per=100.00%, avg=264.77,
>> > stdev=1110.81
>> >     lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%, 50=0.03%
>> >     lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%, 1000=1.35%
>> >     lat (msec) : 2000=4.03%, >=2000=4.77%
>> >   cpu          : usr=0.02%, sys=0.22%, ctx=2973, majf=0,
>> > minf=18446744073709538907
>> >   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%,
>> >>=64=0.0%
>> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >>=64=0.0%
>> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>> >>=64=0.0%
>> >      issued    : total=r=0/w=3924/d=0, short=r=0/w=0/d=0
>> >
>> > Run status group 0 (all jobs):
>> >   WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s,
>> > mint=60737msec, maxt=60737msec
>> >
>> > Disk stats (read/write):
>> >   sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694,
>> > util=100.00%
>> >
>> > anyone can give me some suggestion to improve the performance ?
>> >
>> > Regards
>> >
>> > MQ
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to