On Wed, Mar 23, 2016 at 10:27:00AM -0600, Eric Blake wrote: > On 03/23/2016 08:16 AM, Denis V. Lunev wrote: > > From: Pavel Borzenkov <pborzen...@virtuozzo.com> > > > > With the availability of sparse storage formats, it is often needed to > > query status of a particular LBA range and read only those blocks of > > data that are actually present on the block device. > > The acronym LBA is not used elsewhere in the NBD spec; should we spell > it out at least once? > > > > > To provide such information, the patch adds GET_LBA_STATUS extension > > with one new NBD_CMD_GET_LBA_STATUS command. > > > > There exists a concept of data dirtiness, which is required during, for > > example, incremental block device backup. To express this concept via > > NBD protocol, this patch also adds additional mode of operation to > > NBD_CMD_GET_LBA_STATUS command. > > > > Since NBD protocol has no notion of block size, and to mimic SCSI "GET > > LBA STATUS" command more closely, it has been chosen to return a list of > > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a > > bitmap. > > > > Signed-off-by: Pavel Borzenkov <pborzen...@virtuozzo.com> > > Reviewed-by: Roman Kagan <rka...@virtuozzo.com> > > Signed-off-by: Denis V. Lunev <d...@openvz.org> > > CC: Wouter Verhelst <w...@uter.be> > > CC: Paolo Bonzini <pbonz...@redhat.com> > > CC: Kevin Wolf <kw...@redhat.com> > > CC: Stefan Hajnoczi <stefa...@redhat.com> > > --- > > doc/proto.md | 82 > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 82 insertions(+) > > > > > > > +### `GET_LBA_STATUS` extension > > + > > +With the availability of sparse storage formats, it is often needed to > > query > > +status of a particular LBA range and read only those blocks of data that > > are > > +actually present on the block device. > > + > > +Some storage formats and operations over such formats express a concept of > > +data dirtiness. Whether the operation is block device mirroring, > > +incremental block device backup or any other operation with a concept of > > +data dirtiness, they all share a need to provide a list of LBA ranges > > +that this particular operation treats as dirty. > > + > > +To provide such class of information, `GET_LBA_STATUS` extension adds new > > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with > > +their respective states. > > + > > +* `NBD_CMD_GET_LBA_STATUS` (7) > > + > > + An LBA range status query request. Length and offset define the range > > + of interest. The server MUST reply with a reply header, followed > > + immediately by the following data: > > + > > + - 32 bits, length of parameter data that follow (unsigned) > > Is this length the number of descriptors, or the number of bytes > occupied by those descriptors? It looks like bytes (that is, with the > current layout, this field should be a multiple of 14 unless an error is > returned and the data is bogus). > > I guess 32 bits is sufficient: transmission commands are limited to > 32-bit length, and we are unlikely to have more than one extent per 512 > bytes of length, so even if we have a header for every single sector > (worst-case for alternating clean/dirty sectors), as long as the > smallest granularity of an extent is larger than the extent field, the > 'length of parameter data' in bytes is still sufficient. > > > + - zero or more LBA status descriptors, each having the following > > zero or more? [1] > > > + structure: > > + > > + * 64 bits, offset (unsigned) > > + * 32 bits, length (unsigned) > > + * 16 bits, status (unsigned) > > An array of these status descriptors is packed on the wire, while the > typical C layout of an array of these structures will have padding to > reach a nice 8-byte alignment. Should 'status' be a 32-bit field, so > that clients and servers do not have to pack/unpack between 14 bytes on > the wire and 16 bytes in efficient array handling, at the expense of > more network traffic? > > Conversely, it would be possible to send less data over the wire, as > long as we require that all LBA status descriptors cover consecutive > offsets. That is, having the server reply with offsets is pointless, > since they can be reconstructed on the client by starting with the > offset in the client's request, then adding length from each status > field. Is less network traffic desirable?
I think it's better to explicitly send the start of each LBA extent. So I'll go with changing 'status' field to 32 bits to avoid packing/unpacking issues. > > > + > > + unless an error condition has occurred. > > + > > + If an error occurs, the server SHOULD set the appropriate error code > > + in the error field. The server MUST then either close the > > + connection, or send *length of parameter data* bytes of data > > + (which MAY be invalid). > > + > > + The type of information required by the client is passed to server in > > the > > + command flags field. If the server does not implement requested type or > > + have no means to express it, it MUST NOT return an error, but instead > > MUST > > + return a single LBA status descriptor with *offset* and *length* equal > > to > > + the *offset* and *length* from request, and *status* set to `0`. > > [1] So in what situations will we ever return an array of zero status > fields? On an error? Should we make it clear that the server MUST NOT > return 0 status fields except on an error? > > Do we want to require that the server MUST reply with enough extents to > sum up to the length of the client's request, or are we permitting a > short reply? While the "GET LBA STATUS" command in SCSI allows partial reply, I believe it'd better to keep things simple and require that the server must either return a list of extents that covers the whole requested range, or an error. > > > + > > + The following request types are currently defined for the command: > > + > > + 1. Block provisioning state > > + > > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > > + field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return > > Here, you spell it '0x0'; in the previous patch, you spelled the command > flag as 'bit 1' - does that mean that Block provisioning state is the > default when no command flags are sent? What if we later add other > flags but still want block provisioning mode? Wouldn't it be better to > state that this mode is entered when the NBD_FLAG_GET_DIRTY flag is > clear, without regards to the state of the other flags, than allowing > this mode only when all 16 flags are zero? > > For example, should we allow a flag that states that the client is > interested only in allocated/unallocated, and that the server may > coalesce NBD_STATE_ZEROED extents as if they were NBD_STATE_ALLOCATED > for fewer extents reported and thus potentially less network traffic? Actually, for this command I treat 'command flags' field not as a set of flags, but rather as a plain number representing required mode of operation. Probably, not a good idea as it doesn't match the rest of the commands. I went this way because I didn't want to allow clients to request several modes simultaneously (e.g. provisioning + dirtiness) in the same request. This makes server side implementation harder. I think I'll just switch to bits to match the rest of the commands and will add a note, that server should return EINVAL in case several modes are requested simultaneously. > > > + the provisioning state of the device. The following provisionnig states > > s/provisionnig/provisioning/ > > > + are defined for the command: > > + > > + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block > > device; > > + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device > > + and contains zeroes; > > + - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the > > + block device. A client MUST NOT make any assumptions about the > > + contents of the extent. > > Can NBD_STATE_ALLOCATED and NBD_STATE_DEALLOCATED both be set at the > same time, or is that an error on the server? What do we know about an > extent that has neither NBD_STATE_ALLOCATED nor NBD_STATE_DEALLOCATED set? > > /me re-reads > > Oh, you are using this as the _entire_ 16-bit status value, rather than > as bits 0, 1, and 2 as flags. Yes > > But I think you have two binary flags (four possible states) here: it is > quite conceivable to have a server on top of a SCSI device, where we > know that the extent is unallocated in SCSI, but where the server will > guarantee that it reads as all zeroes (possibly because the server > bypasses SCSI on the NBD read commands when it knows SCSI is > unallocated). That is, if we set this up as two flags: > > 0x1 - allocated > 0x2 - reads as 0 > > then we can express four states: > > 0x0 - LBA extent not present, client MUST NOT make assumptions about > contents, and reads should not be attempted > 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents > 0x2 - LBA extent not present, but client can treat the extent as zeroes > and reads will succeed > 0x3 - LBA extent present, client can treat the extent as zeroes and > reads will succeed I'm not sure that clients need this level of details. From client's POV 0x2 and 0x3 are the same. > > Actually, we should probably pick the bit values such that 0x0 means > allocated and readable, as the most common state, since we also require > that the command returns a single extent over the entire length with > status 0 if the server can't provide any further details. > > I'm not familiar enough with the SCSI "GET LBA STATUS" command to know > if your command sanely matches to that one. Extents returned by "GET LBA STATUS" in SCSI can have three possible state: MAPPED/ANCHORED/DEALLOCATED. These are not bits and cannot be combined together. > > > + > > + 2. Block dirtiness state > > + > > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > > + field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return > > This overlaps with the bit value for NBD_FLAG_SEND_FUA in the previous > patch. Is that okay? Or should we use a different bit value, on the > grounds that some future extension may want to use both flags > orthogonally within the same (possibly new) command? Again, consistency > in the spelling ('bit 1' in the previous patch, '0x1' here), would be nice. I couldn't think of any use case. But, just in case, I'll change the value of the flags so they don't overlap. -- Pavel > > > + the dirtiness status of the device. The following dirtiness states > > + are defined for the command: > > + > > + - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty; > > + - `NBD_STATE_CLEAN` (0x1), LBA extent is clean. > > Again, it looks like you are using these as two entire 16-bit status > values, rather than as two separate bits (1<<0 and 1<<1). Another way > of expressing it is that a single boolean flag is defined, if clear, the > extent is dirty, if set, the extent is clean. > > > + > > + Generic NBD client implementation without knowledge of a particular NBD > > + server operation MUST NOT make any assumption on the meaning of the > > + NBD_STATE_DIRTY or NBD_STATE_CLEAN states. > > + > > +The server SHOULD return `EINVAL` if it receives a `GET_LBA_STATUS` request > > +including one or more sectors beyond the size of the device. > > As mentioned in the previous mail, should we also recommend an EINVAL if > NBD_CMD_GET_LBA_STATUS was not negotiated in the export options but the > client sends the command anyways; and/or a requirement that the client > must not issue the command in that case? > > > + > > ## About this file > > > > This file tries to document the NBD protocol as it is currently > > > > -- > Eric Blake eblake redhat com +1-919-301-3266 > Libvirt virtualization library http://libvirt.org >