On Fri, Jun 21, 2013 at 11:18:07PM +0800, Liu Yuan wrote: > On 06/20/2013 11:58 PM, Sage Weil wrote: > > On Thu, 20 Jun 2013, Stefan Hajnoczi wrote: > >>> The concrete problem here is that flashcache/dm-cache/bcache don't > >>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache > >>> cache access to block devices (in the host layer), and with rbd > >>> (for instance) there is no access to a block device at all. block/rbd.c > >>> simply calls librbd which calls librados etc. > >>> > >>> So the context switches etc. I am avoiding are the ones that would > >>> be introduced by using kernel rbd devices rather than librbd. > >> > >> I understand the limitations with kernel block devices - their > >> setup/teardown is an extra step outside QEMU and privileges need to be > >> managed. That basically means you need to use a management tool like > >> libvirt to make it usable. > >> > >> But I don't understand the performance angle here. Do you have profiles > >> that show kernel rbd is a bottleneck due to context switching? > >> > >> We use the kernel page cache for -drive file=test.img,cache=writeback > >> and no one has suggested reimplementing the page cache inside QEMU for > >> better performance. > >> > >> Also, how do you want to manage QEMU page cache with multiple guests > >> running? They are independent and know nothing about each other. Their > >> process memory consumption will be bloated and the kernel memory > >> management will end up having to sort out who gets to stay in physical > >> memory. > >> > >> You can see I'm skeptical of this and think it's premature optimization, > >> but if there's really a case for it with performance profiles then I > >> guess it would be necessary. But we should definitely get feedback from > >> the Ceph folks too. > >> > >> I'd like to hear from Ceph folks what their position on kernel rbd vs > >> librados is. Why one do they recommend for QEMU guests and what are the > >> pros/cons? > > > > I agree that a flashcache/bcache-like persistent cache would be a big win > > for qemu + rbd users. > > > > There are few important issues with librbd vs kernel rbd: > > > > * librbd tends to get new features more quickly that the kernel rbd > > (although now that layering has landed in 3.10 this will be less > > painful than it was). > > > > * Using kernel rbd means users need bleeding edge kernels, a non-starter > > for many orgs that are still running things like RHEL. Bug fixes are > > difficult to roll out, etc. > > > > * librbd has an in-memory cache that behaves similar to an HDD's cache > > (e.g., it forces writeback on flush). This improves performance > > significantly for many workloads. Of course, having a bcache-like > > layer mitigates this.. > > > > I'm not really sure what the best path forward is. Putting the > > functionality in qemu would benefit lots of other storage backends, > > putting it in librbd would capture various other librbd users (xen, tgt, > > and future users like hyper-v), and using new kernels works today but > > creates a lot of friction for operations. > > > > I think I can share some implementation details about persistent cache > for guest because 1) Sheepdog has a persistent object-oriented cache as > exactly what Alex described 2) Sheepdog and Ceph's RADOS both provide > volumes on top of object store. 3) Sheepdog choose a persistent cache on > local disk while Ceph choose a in memory cache approach. > > The main motivation of object cache is to reduce network traffic and > improve performance and the cache can be seen as a hard disk' internal > write cache, which modern kernels support well. > > For a background introduction, Sheepdog's object cache works similar to > kernel's page cache, except that we cache a 4M object of a volume in > disk while kernel cache 4k page of a file in memory. We use LRU list per > volume to do reclaim and dirty list to track dirty objects for > writeback. We always readahead a whole object if not cached. > > The benefit of a disk cache over a memory cache, in my option, is > 1) VM get a more smooth performance because cache don't consume memory > (if memory is on high water mark, the latency of guest IO will be very > high). > 2) smaller memory requirement and leave all the memory to guest > 3) objects from base can be shared by all its children snapshots & clone > 4) more efficient reclaim algorithm because sheep daemon knows better > than kernel's dm-cache/bcacsh/flashcache. > 5) can easily take advantage of SSD as a cache backend
It sounds like the cache is in the sheep daemon and therefore has a global view of all volumes being accessed from this host. That way it can do things like share the cached snapshot data between volumes. This is what I was pointing out about putting the cache in QEMU - you only know about this QEMU process, not all volumes being accessed from this host. Even if Ceph and Sheepdog don't share code, it sounds like they have a lot in common and it's worth looking at the Sheepdog cache before adding one to Ceph. Stefan