Hi David,

I'm also using Ceph to provide block devices to ESXi datastores.

Currently using tgt with the RBD backend to provide iSCSI.

Also tried SCST, LIO and NFS, here's my take on them.

TGT
Pros: Very Stable, talks direct RBD, easy to setup, pacemaker agents, ok 
performance
Cons: Can't do graceful failover in pacemaker, can't hot extend disks, can't 
add linux block caches (flashcache), not really maintained anymore, can't see 
stats in iostat, doesn't support VAAI

LIO
Pros: Good performance, active/passive ALUA, maintained
Cons: Very unstable

SCST
Pros: Good Performance, stable
Cons: PITA to compile after every kernel update

NFS
Pros: Stable, maintained, page cache acts as read cache
Cons: Limited support in ESXi5.5 (better in 6), Poor performance, not using 
vmfs(is this pro or con?)

Just to touch on a few points, currently Lio has a problem with Ceph RBD's, if 
Ceph fails to action an IO within ~10s both ESXi and Lio enter a never ending 
spiral trying to abort each other. This is being actively worked on, along with 
active/active ALUA support, so will probably become the best solution down the 
line. 

I ended up choosing TGT as it was the only one which I could use in a 
production setting, it's not ideal when you look at the list of cons, but after 
having several dropouts with LIO it's amazing what you will sacrifice for 
stability.

SCST is a good middle ground between tgt and lio, but after hitting numerous 
RBD kernel bugs (see below) and keep having to try different kernels 
recompiling SCST gets old fast.

NFS is actually a really nice solution, but write latency is nearly double that 
of iSCSI as all ESXi writes are sync writes, so you effectively end up waiting 
for 2 Ceph IO's for each ESXi write. 1st the actual data write, 2nd the journal 
of the FS used for the NFS share. Was never able to get more than about 100-150 
write iops out of NFS.

Which brings me onto my next point. 

Sync write latency. 

I think a lot of enterprise applications were designed with traditional 
enterprise storage in mind that can service write IO's measured in 
microseconds. Whereas in Ceph write IO's tend to happen in around 2-4ms. This 
normally isn't too much of a problem, however when doing stuff in ESXi like 
consolidating snapshots/storage vmotion/cloning they are done with 64kb IO's. 

1/0.004 =250 iops
64kb*250 = 16MB/s

Not a hard limit, but you will tend to be limited around about there, which is 
not fun when you try and copy a 2TB VM. Of course IO from VM's will be passed 
through to Ceph at whatever size it is submitted as.

I have been testing out something like flashcache to act like a traditional 
writeback cache, this boosts ESXi performance up to traditional array like 
levels of performance. However....

1. You need SAS SSD's in your iSCSI nodes if you want HA
2. It's an extra layer for something to go wrong
3. An extra layer to manage with pacemaker
4. You can't use it with TGT with RBD engine, which is probably the biggest 
blocker for me right now.
5. Kernel RBD client tends to lag librbd

Also please be aware that in older kernels there is a bug which sets the TCP 
options wrong for RBD (fixed in 3.19) and in more recent kernels (3.19+ I 
think) max IO sizes are limited by a bug as is the maximum queue depth. These 
are both fixed in 4.2 I think. Now that there is a RC of 4.2 I can finally 
start testing RBD to iscsi/NFS again.

Hope that’s helpful
Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Nick Fisk
> Sent: 20 July 2015 15:51
> To: 'David Casier' <david.cas...@aevoo.fr>; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] osd_agent_max_ops relating to number of OSDs in
> the cache pool
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of David Casier
> > Sent: 20 July 2015 00:27
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] osd_agent_max_ops relating to number of OSDs
> > in the cache pool
> >
> > Nick Fisk <nick@...> writes:
> >
> > >
> > > Hi All,
> > >
> > > I’m doing some testing on the new High/Low speed cache tiering
> > > flushing
> > and I’m trying to get my head
> > > round the effect that changing these 2 settings have on the flushing
> > speed.  When setting the
> > > osd_agent_max_ops to 1, I can get up to 20% improvement before the
> > osd_agent_max_high_ops value kicks in
> > > for high speed flushing. Which is great for bursty workloads.
> > >
> > > As I understand it, these settings loosely effect the number of
> > > concurrent
> > operations the cache pool
> > > OSD’s will flush down to the base pool.
> > >
> > > I may have got completely the wrong idea in my head but I can’t
> > > understand
> > how a static default setting will
> > > work with different cache/base ratios. For example if I had a
> > > relatively
> > small number of very fast cache
> > > tier OSD’s (PCI-E SSD perhaps) and a much larger number of base tier
> > OSD’s, would the value need to be
> > > increased to ensure sufficient utilisation of the base tier and make
> > > sure
> > that the cache tier doesn’t
> > > fill up too fast?
> > >
> > > Alternatively where the cache tier is based on spinning disks or
> > > where the
> > base tier is not as comparatively
> > > large, this value may need to be reduced to stop it saturating the disks.
> > >
> > > Any Thoughts?
> > >
> > > Nick
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users <at> lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> > Hi Nick,
> > The best way is that the working space does not exceed the volume of
> > the tier pool.
> > If the workspace does not fit in the tier pool, the average rates
> > should not exceed the performance of base pool.
> 
> Hi David,
> 
> Thanks for your response, I know in an ideal scenario your working set should
> fit in the tier, however often you will be copying in new data or running some
> sort of workload which causes a dramatic change in the cache contents. Part
> of this work is trying to minimise the impact of cache flushing.
> 
> Nick
> 
> >
> > Cordialement,
> > David Casier.
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to