Re: [Qemu-devel] Adding a persistent writeback cache to qemu

Stefan Hajnoczi Mon, 24 Jun 2013 02:32:43 -0700

On Fri, Jun 21, 2013 at 11:18:07PM +0800, Liu Yuan wrote:
> On 06/20/2013 11:58 PM, Sage Weil wrote:
> > On Thu, 20 Jun 2013, Stefan Hajnoczi wrote:
> >>> The concrete problem here is that flashcache/dm-cache/bcache don't
> >>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
> >>> cache access to block devices (in the host layer), and with rbd
> >>> (for instance) there is no access to a block device at all. block/rbd.c
> >>> simply calls librbd which calls librados etc.
> >>>
> >>> So the context switches etc. I am avoiding are the ones that would
> >>> be introduced by using kernel rbd devices rather than librbd.
> >>
> >> I understand the limitations with kernel block devices - their
> >> setup/teardown is an extra step outside QEMU and privileges need to be
> >> managed.  That basically means you need to use a management tool like
> >> libvirt to make it usable.
> >>
> >> But I don't understand the performance angle here.  Do you have profiles
> >> that show kernel rbd is a bottleneck due to context switching?
> >>
> >> We use the kernel page cache for -drive file=test.img,cache=writeback
> >> and no one has suggested reimplementing the page cache inside QEMU for
> >> better performance.
> >>
> >> Also, how do you want to manage QEMU page cache with multiple guests
> >> running?  They are independent and know nothing about each other.  Their
> >> process memory consumption will be bloated and the kernel memory
> >> management will end up having to sort out who gets to stay in physical
> >> memory.
> >>
> >> You can see I'm skeptical of this and think it's premature optimization,
> >> but if there's really a case for it with performance profiles then I
> >> guess it would be necessary.  But we should definitely get feedback from
> >> the Ceph folks too.
> >>
> >> I'd like to hear from Ceph folks what their position on kernel rbd vs
> >> librados is.  Why one do they recommend for QEMU guests and what are the
> >> pros/cons?
> > 
> > I agree that a flashcache/bcache-like persistent cache would be a big win 
> > for qemu + rbd users.  
> > 
> > There are few important issues with librbd vs kernel rbd:
> > 
> >  * librbd tends to get new features more quickly that the kernel rbd 
> >    (although now that layering has landed in 3.10 this will be less 
> >    painful than it was).
> > 
> >  * Using kernel rbd means users need bleeding edge kernels, a non-starter 
> >    for many orgs that are still running things like RHEL.  Bug fixes are 
> >    difficult to roll out, etc.
> > 
> >  * librbd has an in-memory cache that behaves similar to an HDD's cache 
> >    (e.g., it forces writeback on flush).  This improves performance 
> >    significantly for many workloads.  Of course, having a bcache-like 
> >    layer mitigates this..
> > 
> > I'm not really sure what the best path forward is.  Putting the 
> > functionality in qemu would benefit lots of other storage backends, 
> > putting it in librbd would capture various other librbd users (xen, tgt, 
> > and future users like hyper-v), and using new kernels works today but 
> > creates a lot of friction for operations.
> > 
> 
> I think I can share some implementation details about persistent cache
> for guest because 1) Sheepdog has a persistent object-oriented cache as
> exactly what Alex described 2) Sheepdog and Ceph's RADOS both provide
> volumes on top of object store. 3) Sheepdog choose a persistent cache on
> local disk while Ceph choose a in memory cache approach.
> 
> The main motivation of object cache is to reduce network traffic and
> improve performance and the cache can be seen as a hard disk' internal
> write cache, which modern kernels support well.
> 
> For a background introduction, Sheepdog's object cache works similar to
> kernel's page cache, except that we cache a 4M object of a volume in
> disk while kernel cache 4k page of a file in memory. We use LRU list per
> volume to do reclaim and dirty list to track dirty objects for
> writeback. We always readahead a whole object if not cached.
> 
> The benefit of a disk cache over a memory cache, in my option, is
> 1) VM get a more smooth performance because cache don't consume memory
> (if memory is on high water mark, the latency of guest IO will be very
> high).
> 2) smaller memory requirement and leave all the memory to guest
> 3) objects from base can be shared by all its children snapshots & clone
> 4) more efficient reclaim algorithm because sheep daemon knows better
> than kernel's dm-cache/bcacsh/flashcache.
> 5) can easily take advantage of SSD as a cache backend


It sounds like the cache is in the sheep daemon and therefore has a
global view of all volumes being accessed from this host.  That way it
can do things like share the cached snapshot data between volumes.

This is what I was pointing out about putting the cache in QEMU - you
only know about this QEMU process, not all volumes being accessed from
this host.

Even if Ceph and Sheepdog don't share code, it sounds like they have a
lot in common and it's worth looking at the Sheepdog cache before adding
one to Ceph.

Stefan

Re: [Qemu-devel] Adding a persistent writeback cache to qemu

Reply via email to