On 09.07.19 15:09, Stefano Garzarella wrote: > On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote: >> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mre...@redhat.com> wrote: >>> >>> On 09.07.19 10:55, Max Reitz wrote: >>>> On 09.07.19 05:08, Jason Dillaman wrote: >>>>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarz...@redhat.com> >>>>> wrote: >>>>>> >>>>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote: >>>>>>> On 05.07.19 11:32, Stefano Garzarella wrote: >>>>>>>> This patch allows 'qemu-img info' to show the 'disk size' for >>>>>>>> the RBD images that have the fast-diff feature enabled. >>>>>>>> >>>>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API >>>>>>>> to calculate the allocated size for the image. >>>>>>>> >>>>>>>> Signed-off-by: Stefano Garzarella <sgarz...@redhat.com> >>>>>>>> --- >>>>>>>> v3: >>>>>>>> - return -ENOTSUP instead of -1 when fast-diff is not available >>>>>>>> [John, Jason] >>>>>>>> v2: >>>>>>>> - calculate the actual usage only if the fast-diff feature is >>>>>>>> enabled [Jason] >>>>>>>> --- >>>>>>>> block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> 1 file changed, 54 insertions(+) >>>>>>> >>>>>>> Well, the librbd documentation is non-existing as always, but while >>>>>>> googling, I at least found that libvirt has exactly the same code. So I >>>>>>> suppose it must be quite correct, then. >>>>>>> >>>>>> >>>>>> While I wrote this code I took a look at libvirt implementation and also >>>>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in >>>>>> src/tools/rbd/action/DiskUsage.cc >>>>>> >>>>>>>> diff --git a/block/rbd.c b/block/rbd.c >>>>>>>> index 59757b3120..b6bed683e5 100644 >>>>>>>> --- a/block/rbd.c >>>>>>>> +++ b/block/rbd.c >>>>>>>> @@ -1084,6 +1084,59 @@ static int64_t >>>>>>>> qemu_rbd_getlength(BlockDriverState *bs) >>>>>>>> return info.size; >>>>>>>> } >>>>>>>> >>>>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int >>>>>>>> exists, >>>>>>>> + void *arg) >>>>>>>> +{ >>>>>>>> + int64_t *alloc_size = (int64_t *) arg; >>>>>>>> + >>>>>>>> + if (exists) { >>>>>>>> + (*alloc_size) += len; >>>>>>>> + } >>>>>>>> + >>>>>>>> + return 0; >>>>>>>> +} >>>>>>>> + >>>>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs) >>>>>>>> +{ >>>>>>>> + BDRVRBDState *s = bs->opaque; >>>>>>>> + uint64_t flags, features; >>>>>>>> + int64_t alloc_size = 0; >>>>>>>> + int r; >>>>>>>> + >>>>>>>> + r = rbd_get_flags(s->image, &flags); >>>>>>>> + if (r < 0) { >>>>>>>> + return r; >>>>>>>> + } >>>>>>>> + >>>>>>>> + r = rbd_get_features(s->image, &features); >>>>>>>> + if (r < 0) { >>>>>>>> + return r; >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff >>>>>>>> + * feature enabled. If it is disabled, rbd_diff_iterate2() could >>>>>>>> be >>>>>>>> + * very slow on a big image. >>>>>>>> + */ >>>>>>>> + if (!(features & RBD_FEATURE_FAST_DIFF) || >>>>>>>> + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { >>>>>>>> + return -ENOTSUP; >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL, >>>>>>>> invokes >>>>>>>> + * the callback on all allocated regions of the image. >>>>>>>> + */ >>>>>>>> + r = rbd_diff_iterate2(s->image, NULL, 0, >>>>>>>> + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1, >>>>>>>> + &rbd_allocated_size_cb, &alloc_size); >>>>>>> >>>>>>> But I have a question. This is basically block_status, right? So it >>>>>>> gives us information on which areas are allocated and which are not. >>>>>>> The result thus gives us a lower bound on the allocation size, but is it >>>>>>> really exactly the allocation size? >>>>>>> >>>>>>> There are two things I’m concerned about: >>>>>>> >>>>>>> 1. What about metadata? >>>>>> >>>>>> Good question, I don't think it includes the size used by metadata and I >>>>>> don't know if there is a way to know it. I'll check better. >>>>> >>>>> It does not include the size of metadata, the "rbd_diff_iterate2" >>>>> function is literally just looking for touched data blocks within the >>>>> RBD image. >>>>> >>>>>>> >>>>>>> 2. If you have multiple snapshots, this will only report the overall >>>>>>> allocation information, right? So say there is something like this: >>>>>>> >>>>>>> (“A” means an allocated MB, “-” is an unallocated MB) >>>>>>> >>>>>>> Snapshot 1: AAAA--- >>>>>>> Snapshot 2: --AAAAA >>>>>>> Snapshot 3: -AAAA-- >>>>>>> >>>>>>> I think the allocated data size is the number of As in total (13 MB). >>>>>>> But I suppose this API will just return 7 MB, because it looks on >>>>>>> everything an it sees the whole image range (7 MB) to be allocated. It >>>>>>> doesn’t report in how many snapshots some region is allocated. >>>>> >>>>> It should return 13 dirty data blocks (multipled by the size of the >>>>> data block) since when you don't provide a "from snapshot" name, it >>>>> will iterate from the first snapshot to the HEAD revision. >>>> >>>> Have you tested that? >>>> >>>> I‘m so skeptical because the callback function interface has no way of >>>> distinguishing between different layers of snapshots. >>>> >>>> And also because we have the bdrv_block_status_above() function which >>>> just looks strikingly similar (with the difference that it does not >>>> invoke a callback but just returns the next allocated range). If you >>>> pass base=NULL to it, it will also “interpret that as the beginning of >>>> time and return all allocated regions of the image” (or rather image >>>> chain, in our case). But it would just return 7 MB as allocated. (Even >>>> though it does in fact return layer information, i.e. where a given >>>> continuous chunk of data is allocated.) >>>> >>>> Sure, there is no good reason for why our interface should by chance be >>>> the same as librbd’s interface. But without having tested it, the fact >>>> that the callback cannot detect which layer a chunk is allocated on just >>>> makes me wary. >>> >>> And the librbd code doesn’t alleviate my concerns. >>> >>> From what I can see, it first creates a bitmap (two bits per entry) that >>> covers the whole image and says which objects are allocated and which >>> aren‘t. Through the whole chain, that is. So in the above example, the >>> bitmap would report everything as allocated. (Or rather “updated“ in >>> librbd‘s terms.) >>> >>> Then it has this piece: >>> >>>> uint64_t off = m_offset; >>>> uint64_t left = m_length; >>>> >>>> while (left > 0) { >>>> uint64_t period_off = off - (off % period); >>>> uint64_t read_len = min(period_off + period - off, left); >>>> >>>> // map to extents >>>> map<object_t,vector<ObjectExtent> > object_extents; >>>> Striper::file_to_extents(cct, m_image_ctx.format_string, >>>> &m_image_ctx.layout, off, read_len, 0, >>>> object_extents, 0); >>>> >>>> // get snap info for each object >>>> for (map<object_t,vector<ObjectExtent> >::iterator p = >>>> object_extents.begin(); >>>> p != object_extents.end(); ++p) >>> [...] >>>> for (std::vector<ObjectExtent>::iterator q = p->second.begin(); >>>> q != p->second.end(); ++q) { >>>> r = m_callback(off + q->offset, q->length, updated, >>>> m_callback_arg); >>> [...] >>>> } >>> [...] >>>> } >>>>> left -= read_len; >>>> off += read_len; >>>> } >>> >>> So that iterates over the whole image (in our case, because m_offset is >>> 0 and m_length is the image length), then picks out a chunk of read_len >>> length, converts that to objects, and then iterates over all of those >>> objects and all of their extents. >>> >>> file_to_extents looks like it’s just an arithmetic operation. So it >>> doesn‘t look like that function returns extents in multiple snapshots. >>> It just converts a range into subranges called “objects” and “extents” >>> (at least that’s the way it looks to me). >>> >>> So overall, this looks awfully like it simply iterates over the whole >>> image and then returns allocation information gathered as a top-down >>> view through all of the snapshots, but not for each snapshot individually. >> >> Sorry, you're correct. The API function was originally designed to >> support delta extents for supporting diff exports, so while it does >> open each snapshot's object-map, it only returns a unioned set of >> delta extents. The rbd CLI's "disk-usage" action behaves how I >> described by counting the actual dirty block usage between snapshots. > > Max, Jason, thanks for the details! > > If you agree, I'll try to implement something similar, iterating on all > snapshots.
Yes, that should work. > What about metadata? > I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata. > An idea could be to iterate over them and sum the keys-values size returned, > but I'm not sure they are returned in the same format as they are stored. I wouldn’t mind ignoring metadata too much. If you can get something to work, even better, but I don’t consider that too important. Max
signature.asc
Description: OpenPGP digital signature