Stefan,
--On 21 June 2013 14:55:20 +0200 Stefan Hajnoczi <stefa...@gmail.com> wrote:
I understand the limitations with kernel block devices - their
setup/teardown is an extra step outside QEMU and privileges need to be
managed. That basically means you need to use a management tool like
libvirt to make it usable.
It's not just the management tool (we have one of those). Kernel
devices are pain. As a trivial example, duplication of UUIDs, LVM IDs
etc. by hostile guests can cause issues.
If you have those problems then something is wrong:
LVM shouldn't definitely not be scanning guest devices.
As for disk UUIDs, they come from the SCSI target which is under your
control, right? In fact, you can assign different serial numbers to
drives attached in QEMU, the host serial number will not be used.
Therefore, there is a clean separation there and guests do not control
host UUIDs.
The one true weakness here is that Linux reads partition tables
automatically. Not sure if there's a way to turn it off or how hard
it would be to add that.
Most things are work-roundable, but the whole thing is 'fail open'.
See
http://lwn.net/Articles/474067/
for example (not the greatest example, as I've failed to find
the lwn.net article what talked about malicious disk labels).
When a disk is inserted (guest disk mounted), its partition table
gets scanned, and various other stuff happens from udev triggers,
based on the UUID of the disk (I believe the relevant UUID is
actually stored in the file system), and the UUID/label on the GPT.
lvm scanning is also done by default, as is dm stuff. The
same problem happens (in theory) with disk labels. Yes, you
can disable this, but making (e.g.) dm and lvm work on attached
scsi disks but not iscsi disks, in general, when you don't know
the iscsi or scsi vendor is non-trivial (yes, I've done it).
I have not found a way yet to avoid reading the partition table at
all (which would be useful).
There used to be other problems when iscsi is used in anger. One,
for instance, is that the default iscsi client scans the scsi
bus (normally unnecessarily) at the drop of a hat. Even if you
know all information about what you are mounting, it scans it.
This leads to an O(n^2) problem starting VMs - several minutes.
Again, I know how to fix this - if you are interested:
https://github.com/abligh/open-iscsi/tree/add-no-scanning-option
All this is solved by using the inbuilt iscsi client.
That's true, but I'd argue that is a little different because nothing
blocks on the page cache (it being in RAM). You don't get the situation
where the tasks sleeps awaiting data (from the page cache), the data
arrives, and the task then needs to to be scheduled in. I will admit
to a degree of handwaving here as I hadn't realised the claim qemu+rbd
was more efficient than qemu+blockdevice+kernelrbd was controversial.
It may or may not be more efficient, unless there is some performance
analysis we don't know how big a difference and why.
Sure. I hope Sage comes back on this one.
but if there's really a case for it with performance profiles then I
guess it would be necessary. But we should definitely get feedback from
the Ceph folks too.
The specific problem we are trying to solve (in case that's not
obvious) is the non-locality of data read/written by ceph. Whilst
you can use placement to localise data to the rack level, even if
one of your OSDs is in the machine you end up waiting on network
traffic. That is apparently hard to solve inside Ceph.
I'm not up-to-speed on Ceph architecture, is this because you need to
visit a metadata server before you access the storage. Even when the
data is colocated on the same machine you'll need to ask the metadata
server first?
Well, Sage would be the expert, but I understand the problem is
simpler than that. Firstly, in order to mark the write as complete
it has to be written to at least a quorum of OSDs, and a quorum is
larger than one. Hence at least one write is non-local. Secondly,
Ceph's placement group feature does not (so $inktank guy told me)
work well for localising at the level of particular servers; so
even if somehow you made ceph happy with writing just one replica
and saying it was done (and doing the rest in the background), you'd
be hard pressed to ensure the first replica written was always
(or nearly always) on a local spindle. Hence my idea of adding a
layer in front which would acknowledge the write - even if
a flush/fua had come in - on the basis it had been written to
persistent storage and it can recover on a reboot after this
point, then go sort the rest out in the background.
--
Alex Bligh