Hi Wouter,
Some of this mess may be partially my fault, but I have not been
following the NBD extension proposals up until this point.
Are you familiar with the genesis behind this idea and what we are
trying to accomplish in general?
We had the thought to propose an extension roughly similar to SCSI's
'get lba status', but the existing status bits there did not correlate
semantically to what we are hoping to convey. There was some debate over
if it would be an abuse of protocol to attempt to use them as such.
For a quick recap, get lba status appears to offer three canonical statuses:
0h: "LBA extent is mapped, [...] or has an unknown state"
1h: "[...] LBA extent is deallocated"
2h: "[...] LBA extent is anchored"
My interpretation of mapped was simply that it was physically allocated,
and 'deallocated' was simply unallocated.
(I uh, am not actually clear on what anchored means exactly.)
Either way, we felt at the time that it would be wrong to propose an
analogue command for NBD and then immediately abuse the existing
semantics, hence a new command like -- but not identical to -- the SCSI one.
On 11/27/2016 02:17 PM, Wouter Verhelst wrote:
Hi Vladimir,
Quickly: the reason I haven't merged this yes is twofold:
- I wasn't thrilled with the proposal at the time. It felt a bit
hackish, and bolted onto NBD so you could use it, but without defining
everything in the NBD protocol. "We're reading some data, but it's not
about you". That didn't feel right
- There were a number of questions still unanswered (you're answering a
few below, so that's good).
For clarity, I have no objection whatsoever to adding more commands if
they're useful, but I would prefer that they're also useful with NBD on
its own, i.e., without requiring an initiation or correlation of some
state through another protocol or network connection or whatever. If
that's needed, that feels like I didn't do my job properly, if you get
my point.
On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
With the availability of sparse storage formats, it is often needed
to query status of a particular range and read only those blocks of
data that are actually present on the block device.
To provide such information, the patch adds the BLOCK_STATUS
extension with one new NBD_CMD_BLOCK_STATUS command, a new
structured reply chunk format, and a new transmission flag.
There exists a concept of data dirtiness, which is required
during, for example, incremental block device backup. To express
this concept via NBD protocol, this patch also adds a flag to
NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
provisioning information; however, with the current proposal, data
dirtiness is only useful with additional coordination outside of
the NBD protocol (such as a way to start and stop the server from
tracking dirty sectors). Future NBD extensions may add commands
to control dirtiness through NBD.
Since NBD protocol has no notion of block size, and to mimic SCSI
"GET LBA STATUS" command more closely, it has been chosen to return
a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
instead of a bitmap.
CC: Pavel Borzenkov <pborzen...@virtuozzo.com>
CC: Denis V. Lunev <d...@openvz.org>
CC: Wouter Verhelst <w...@uter.be>
CC: Paolo Bonzini <pbonz...@redhat.com>
CC: Kevin Wolf <kw...@redhat.com>
CC: Stefan Hajnoczi <stefa...@redhat.com>
Signed-off-by: Eric Blake <ebl...@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
---
v3:
Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
removing the restriction, that sum of status descriptor lengths must be equal
to requested length. I.e., let's permit the server to replay with less data
than required if it wants.
Reasonable, yes. The length that the client requests should be a maximum (i.e.
"I'm interested in this range"), not an exact request.
Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
NBD_FLAG_CAN_MULTI_CONN in master branch.
Right.
And, finally, I've rebased this onto current state of
extension-structured-reply branch (which itself should be rebased on
master IMHO).
Probably a good idea, given the above.
By this resend I just want to continue the diqussion, started about half
a year ago. Here is a summary of some questions and ideas from v2
diqussion:
1. Q: Synchronisation. Is such data (dirty/allocated) reliable?
A: This all is for read-only disks, so the data is static and unchangeable.
I think we should declare that it's up to the client to ensure no other
writes happen without its knowledge. This may be because the client and
server communicate out of band about state changes, or because the
client somehow knows that it's the only writer, or whatever.
We can easily do that by declaring that the result of that command only
talks about *current* state, and that concurrent writes by different
clients may invalidate the state. This is true for NBD in general (i.e.,
concurrent read or write commands from other clients may confuse file
systems on top of NBD), so it doesn't change expectations in any way.
Agree with you here. "This was correct when I sent it, but it's up to
you to ensure that nothing would have invalidated that in the meantime"
is fine semantically to me.
Of course in our implementation, we intend to only export essentially
read-only snapshots of data, so we don't personally expect to run into
any of these kind of semantic problems.
2. Q: different granularities of dirty/allocated bitmaps. Any problems?
A: 1: server replies with status descriptors of any size, granularity
is hidden from the client
2: dirty/allocated requests are separate and unrelated to each
other, so their granularities are not intersecting
Not entirely sure anymore what this is about?
We have a concept of granularity for dirty tracking; consider it a block
size.
I don't think it's necessarily relevant to NBD, except for the case of
false positives. Consider the case of a 512 byte block that is assumed
dirty simply because it's adjacent to actually dirty data.
That may have some meaning for how we write the NBD spec; i.e. the
meaning of the status bit changes from "This is definitely modified" to
"This was possibly modified," due to limitations in the management of
this information by the server.
If we expose the granularity information via NBD, it at least makes it
clear how fuzzy the results presented may be. Otherwise, it's not really
required.
[Again, as seen below, a user-defined bit would be plenty sufficient...!]
3. Q: selecting of dirty bitmap to export
A: several variants:
1: id of bitmap is in flags field of request
pros: - simple
cons: - it's a hack. flags field is for other uses.
- we'll have to map bitmap names to these "ids"
2: introduce extended nbd requests with variable length and exploit this
feature for BLOCK_STATUS command, specifying bitmap identifier.
pros: - looks like a true way
cons: - we have to create additional extension
- possible we have to create a map,
{<QEMU bitmap name> <=> <NBD bitmap id>}
3: exteranl tool should select which bitmap to export. So, in case of Qemu
it should be something like qmp command block-export-dirty-bitmap.
pros: - simple
- we can extend it to behave like (2) later
cons: - additional qmp command to implement (possibly, the lesser evil)
note: Hmm, external tool can make chose between allocated/dirty data
too,
so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
Downside of 3, though, is that it moves the definition of what the
different states mean outside of the NBD protocol (i.e., the protocol
messages are not entirely defined anymore, and their meaning depends on
the clients and servers in use).
To avoid this, we should have a clear definition of what the reply means
*by default*, but then we can add a note that clients and servers can
possibly define other meanings out of band if they want to.
I had personally only ever considered #3; where a command would be
issued to QEMU to begin offering NBD data for some point-in-time as
associated with a particular reference/snapshot/backup/etc. This leaves
it up to the out-of-band client to order up the right data.
That does appear to choose a meaningful name for the status bits a bit
more difficult...
[...but reading ahead, a 'user defined' bit would fit the bill just fine.]
When Vladimir authored a "persistence" feature for bitmaps to be stored
alongside QCOW2 files, we had difficulty describing exactly what a dirty
bit meant for the data -- ultimately it is reliant on external
information. The meaning is simply but ambiguously, "The data associated
with this bit has been changed since the last time the bit was reset."
We don't record or stipulate the conditions for a reset.
(for our purposes, a reset would occur during a full or incremental backup.)
4. Q: Should not get_{allocated,dirty} be separate commands?
cons: Two commands with almost same semantic and similar means?
pros: However here is a good point of separating clearly defined and native
for block devices GET_BLOCK_STATUS from user-driven and actually
undefined data, called 'dirtyness'.
Yeah, having them separate commands might be a bad idea indeed.
5. Number of status descriptors, sent by server, should be restricted
variants:
1: just allow server to restrict this as it wants (which was done in v3)
2: (not excluding 1). Client specifies somehow the maximum for number
of descriptors.
2.1: add command flag, which will request only one descriptor
(otherwise, no restrictions from the client)
2.2: again, introduce extended nbd requests, and add field to
specify this maximum
I think having a flag which requests just one descriptor can be useful,
but I'm hesitant to add it unless it's actually going to be used; so in
other words, I'll leave the decision on that bit to you.
6. A: What to do with unspecified flags (in request/reply)?
I think the normal variant is to make them reserved. (Server should
return EINVAL if found unknown bits, client should consider replay
with unknown bits as an error)
Right, probably best to do that, yes.
======
Also, an idea on 2-4:
As we say, that dirtiness is unknown for NBD, and external tool
should specify, manage and understand, which data is actually
transmitted, why not just call it user_data and leave status field
of reply chunk unspecified in this case?
So, I propose one flag for NBD_CMD_BLOCK_STATUS:
NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
Eric's 'Block provisioning status' paragraph. If it is set, we just
leave status field for some external... protocol? Who knows, what is
this user data.
Yes, this sounds like a reasonable approach.
I'd be pretty happy (personally) with some user defined bits. Leaves a
lot of the ambiguousness of exactly what we're trying to convey away
from the NBD spec, which is nice.
Note: I'm not sure, that I like this (my) proposal. It's just an
idea, may be someone like it. And, I think, it represents what we
are trying to do more honestly.
Indeed.
Note2: the next step of generalization will be NBD_CMD_USER, with
variable request size, structured reply and no definition :)
Well, er, no please, if we can avoid it :-)
Another idea, about backups themselves:
Why do we need allocated/zero status for backup? IMHO we don't.
Well, I've been thinking so all along, but then I don't really know what
it is, in detail, that you want to do :-)
I can understand a "has this changed since time X" request, which the
"dirty" thing seems to want to be. Whether something is allocated is
just a special case of that.
Actually, come to think of that. What is the exact use case for this
thing? I understand you're trying to create incremental backups of
things, which would imply you don't write from the client that is
getting the block status thingies, right? If so, how about:
Essentially you can create a bitmap object in-memory in QEMU and then
associate it with a drive. It records changes to the drive, and QEMU can
be instructed to write out any changes since the last backup to disk,
while clearing the bits of the bitmap.
It doesn't record a specific point in time, it's just implicit to the
last time you cleared the bitmap -- usually the last incremental or full
backup you've made.
So it does describe "blocks changed since time X," it's just that we
don't really know exactly when time X was.
This is all well and dandy, except there is some desire from third
parties to be able to ask QEMU about this dirty block information -- to
be able to see this bitmap, more or less. We already use NBD for
exporting data, and instead of inventing a new side-band, we decided
that we wanted to use NBD to let external users get this information.
What exactly they do with that info is beyond QEMU's scope.
- NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of
snapshots. Not required, optional, includes a machine-readable form,
not defined by NBD, which explains what the snapshot is about (e.g., a
qemu json file). The "base" version of that is just "allocation
status", and is implied (i.e., you don't need to run
NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the
allocation status).
We don't necessarily have /snapshots/ per se, but we do have some block
device and one or more bitmaps describing deltas to one or more backups
that we do not necessarily have access to.
e.g. a given bitmap may describe a delta to an off-site backup. We know
the delta, but do not maintain any meaningful handle to the given
off-site backup, including name, URI, or date.
QEMU leaves this association up to upper management, and provides only
IDs to facilitate any additional correlation.
We could offer up descriptions of these bitmaps in response to such a
command, but they're not ... quite snapshots. It is more the case that
we use them to create snapshots.
- NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors
which tell you what the status of a block of data is for each of the
relevant snapshots that we know about.
Perhaps this is somewhat overengineered, but it does bring most of the
definition of what a snapshot is back into the NBD protocol, without
having to say "this could be anything", and without requiring
connectivity over two ports for this to be useful (e.g., you could store
the machine-readable form of the snapshot description into your backup
program and match what they mean with what you're interested in at
restore time, etc).
This wouldn't work if you're interested in new snapshots that get
created once we've already moved into transmission, but hey.
Thoughts?
Full backup: just do structured read - it will show us, which chunks
may be treaded as zeroes.
Right.
Incremental backup: get dirty bitmap (somehow, for example through
user-defined part of proposed command), than, for dirty blocks, read
them through structured read, so information about zero/unallocated
areas are here.
For me all the variants above are OK. Let's finally choose something.
v2:
v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html
Since then, we've added the STRUCTURED_REPLY extension, which
necessitates a rather larger rebase; I've also changed things
to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
modes to be determined by boolean flags (rather than by fixed
values of the 16-bit flags field), changed the reply status fields
to be bitwise-or values (with a default of 0 always being sane),
and changed the descriptor layout to drop an offset but to include
a 32-bit status so that the descriptor is nicely 8-byte aligned
without padding.
doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 154 insertions(+), 1 deletion(-)
[...]
I'll commit this in a minute into a separate branch called
"extension-blockstatus", under the understanding that changes are still
required, as per above (i.e., don't assume that just because there's a
branch I'm happy with the current result ;-)
Regards
Err, I hope I haven't confused everything to heck and back now....
Thanks,
--js