On Mon, Aug 22, 2011 at 09:27:12AM -0500, Ryan Harper wrote: > * Stefan Hajnoczi <stefa...@gmail.com> [2011-08-22 08:35]: > > At KVM Forum Kevin, Christoph, and I had an opportunity to get > > together for a Block Layer BoF. We went through the recent "roadmap" > > mailing list thread and touched on each proposed feature. > > > > Here is the block layer roadmap wiki page: > > http://wiki.qemu.org/BlockRoadmap > > > > Kevin: I have moved the runtime WCE toggling to QEMU 1.0 since you > > mentioned you want it for the next release. > > > > My main take-away from the BoF was that integrating support for host > > block devices and storage appliances will allow us to reduce the > > amount of effort spent on image formats. In order to make image > > formats support the desired features and performance we end up > > implementing much of the storage stack and file systems in userspace - > > code that is duplicated and cannot take advantage of the existing > > storage stack. > > +1 > > > > > Storage management features are not just available in remote SAN and > > NAS appliances anymore. For local storage, btrfs has file-level > > clones and thin-dev is significantly improving LVM snapshots. > > > > Thin-dev is bringing a much more efficient and scalable snapshot model > > to LVM. This device-mapper feature will make LVM attractive for high > > performance I/O without giving up snapshot and clone features. It > > also supports cloning off block devices that are not in the pool (e.g. > > external storage, much like QEMU's backing files feature): > > https://github.com/jthornber/linux-2.6/tree/thin-dev > > > > This will not replace image formats overnight because image formats > > are still widely used and will continue to be a useful for > > transferring and sharing disk images. But focussing on the larger > > Any thoughts on how to make this easily usable for LVM? If there were > an export/import to/from file to LVM? is that sufficient? Anything > like this in existence?
Forgot to mention a major advantage of a raw-oriented storage stack: we need good support for raw + storage appliance anyway. Users want to hook up their SAN or NAS just like they can with other hypervisors. Time spent on image formats would be better spent fleshing out integration with LVM, btrfs, SAN, NAS, and friends. Back to import/export, it serves two purposes: 1. Efficient transport. Uploading and downloading image files in a compact form that represents zero blocks efficiently and perhaps compresses data. 2. Compatibility with other hypervisors and external tools. Here it's all about using a well-defined file format. In order to pull off a raw-oriented storage stack we need to do import/export well. So this is an area where we have to focus. Image streaming is a good approach for import because it allows the VM to start instantly (even before the image is fully imported). A qemu-nbd process serves up image data and we stream into a logical volume. For export we can do a fuse file system that presents logical volumes as image files. That way existing applications can get at the data as if there were real image files sitting on the file system. Sequential read access is easy for all formats, random read is more difficult but should be doable for most formats (the exception would be stream compressed formats that are not designed for random access). So moving to a raw-oriented storage stack does not mean we get rid of image formats. We still need them but they are outside the critical I/O path. Their role is changed since we don't push features into the formats anymore. Side note: iSCSI vs NBD came up during the BoF. Although NBD has not seen maintenance or activity recently it's perfectly possible to build on it. The first feature we need is a flush command (so that NBD can do non-O_DSYNC accesses for speed). At that point we have a bare-bones remote block protocol that can be used for migration and for connecting up userspace image formats. iSCSI is more complex and suited for permanent storage, whereas NBD is simple but perhaps not a protocol we want to access data over for a long period of time. Stefan