Hi,
I have a full-ssd cluster on my hands, currently running Dumpling, with plans 
to upgrade soon, and Openstack with RBD on top of that. While I am overall 
quite happy with the performance (scales well accross clients), there is one 
area where it really fails bad - big database workloads.

Typically, what a well-behaved database does is commit to disk every 
transaction before confirming it, so on a “typical” cluster with a write 
latency of 5ms (with SSD journal) the maximum number of transactions per second 
for a single client is 200 (likely more like 100 depending on the filesystem). 
Now, that’s not _too_ bad when running hundreds of small databases, but it’s 
nowhere near the required performance to subsitute an existing SAN or even just 
a simple RAID array with writeback cache.

First hope was that enabling RBD cache will help - but it really doesn’t 
because all the flushes (O_DIRECT writes) end on the drives and not in the 
cache. Disabling barriers in the client helps, but that makes it not crash 
consistent (unless one uses ext4 with journal_checksum etc., I am going to test 
that soon).

Are there any plans to change this behaviour - i.e. make the cache a real 
writeback cache?

I know there are good reasons not to do this, and I commend the developers for 
designing the cache this way, but real world workloads demand shortcuts from 
time to time - for example MySQL with its InnoDB engine has an option to only 
commit to disk every Nth transaction - and this is exactly the kind of thing 
I’m looking for. Not having every confirmed transaction/write on the disk is 
not a huge problem, having a b0rked filesystem is, so this should be safe as 
long as I/O order is preserved. Sadly, my database is not an InnoDB where I can 
tune something, but an enterprise behemoth that traditionally runs on FC 
arrays, it has no parallelism (that I could find), and always uses O_DIRECT for 
txlog.

(For the record - while the array is able to swallow 30K IOps for a minute, 
once the cache is full it slows to ~3 IOps, while CEPH happily gives the same 
200 IOps forever, bottom line is you always need more disks or more cache, and 
your workload should always be able to run without the cache anyway  - even 
enterprise arrays fail, and write cache is not always available, contrary to 
popular belief).

Is there some option that we could use right now to turn on a true writeback 
caching? Losing a few transactions is fine as long as ordering is preserved.
I was thinking “cache=unsafe” but I have no idea whether I/O order is preserved 
with that.
I already mentioned turning off barriers, which could be safe in some setups 
but needs testing.
Upgrading from Dumpling will probably help with scaling, but will it help write 
latency? I would need to get from 5ms/write to <1ms/write.
I investigated guest-side caching (enhanceio/flashcache) but that fails really 
bad when the guest or host crashes - lots of corruption. EnhanceIO in 
particular looked very nice and claims to respect barriers… not in my 
experience, though.

It might seem that what I want is evil, and it really is if you’re running a 
banking database, but for most people this is exactly what is missing to make 
their workloads run without having some sort of 80s SAN system in their 
datacentre, I think everyone here would appreciate that :-)

Thanks

Jan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to