Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback

Max Reitz Tue, 09 Jul 2019 08:35:47 -0700

On 09.07.19 15:09, Stefano Garzarella wrote:
> On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote:
>> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mre...@redhat.com> wrote:
>>>
>>> On 09.07.19 10:55, Max Reitz wrote:
>>>> On 09.07.19 05:08, Jason Dillaman wrote:
>>>>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarz...@redhat.com> 
>>>>> wrote:
>>>>>>
>>>>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
>>>>>>> On 05.07.19 11:32, Stefano Garzarella wrote:
>>>>>>>> This patch allows 'qemu-img info' to show the 'disk size' for
>>>>>>>> the RBD images that have the fast-diff feature enabled.
>>>>>>>>
>>>>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
>>>>>>>> to calculate the allocated size for the image.
>>>>>>>>
>>>>>>>> Signed-off-by: Stefano Garzarella <sgarz...@redhat.com>
>>>>>>>> ---
>>>>>>>> v3:
>>>>>>>>   - return -ENOTSUP instead of -1 when fast-diff is not available
>>>>>>>>     [John, Jason]
>>>>>>>> v2:
>>>>>>>>   - calculate the actual usage only if the fast-diff feature is
>>>>>>>>     enabled [Jason]
>>>>>>>> ---
>>>>>>>>  block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>  1 file changed, 54 insertions(+)
>>>>>>>
>>>>>>> Well, the librbd documentation is non-existing as always, but while
>>>>>>> googling, I at least found that libvirt has exactly the same code.  So I
>>>>>>> suppose it must be quite correct, then.
>>>>>>>
>>>>>>
>>>>>> While I wrote this code I took a look at libvirt implementation and also
>>>>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
>>>>>> src/tools/rbd/action/DiskUsage.cc
>>>>>>
>>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>>>> index 59757b3120..b6bed683e5 100644
>>>>>>>> --- a/block/rbd.c
>>>>>>>> +++ b/block/rbd.c
>>>>>>>> @@ -1084,6 +1084,59 @@ static int64_t 
>>>>>>>> qemu_rbd_getlength(BlockDriverState *bs)
>>>>>>>>      return info.size;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int 
>>>>>>>> exists,
>>>>>>>> +                                 void *arg)
>>>>>>>> +{
>>>>>>>> +    int64_t *alloc_size = (int64_t *) arg;
>>>>>>>> +
>>>>>>>> +    if (exists) {
>>>>>>>> +        (*alloc_size) += len;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
>>>>>>>> +{
>>>>>>>> +    BDRVRBDState *s = bs->opaque;
>>>>>>>> +    uint64_t flags, features;
>>>>>>>> +    int64_t alloc_size = 0;
>>>>>>>> +    int r;
>>>>>>>> +
>>>>>>>> +    r = rbd_get_flags(s->image, &flags);
>>>>>>>> +    if (r < 0) {
>>>>>>>> +        return r;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    r = rbd_get_features(s->image, &features);
>>>>>>>> +    if (r < 0) {
>>>>>>>> +        return r;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    /*
>>>>>>>> +     * We use rbd_diff_iterate2() only if the RBD image have fast-diff
>>>>>>>> +     * feature enabled. If it is disabled, rbd_diff_iterate2() could 
>>>>>>>> be
>>>>>>>> +     * very slow on a big image.
>>>>>>>> +     */
>>>>>>>> +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
>>>>>>>> +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
>>>>>>>> +        return -ENOTSUP;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    /*
>>>>>>>> +     * rbd_diff_iterate2(), if the source snapshot name is NULL, 
>>>>>>>> invokes
>>>>>>>> +     * the callback on all allocated regions of the image.
>>>>>>>> +     */
>>>>>>>> +    r = rbd_diff_iterate2(s->image, NULL, 0,
>>>>>>>> +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
>>>>>>>> +                          &rbd_allocated_size_cb, &alloc_size);
>>>>>>>
>>>>>>> But I have a question.  This is basically block_status, right?  So it
>>>>>>> gives us information on which areas are allocated and which are not.
>>>>>>> The result thus gives us a lower bound on the allocation size, but is it
>>>>>>> really exactly the allocation size?
>>>>>>>
>>>>>>> There are two things I’m concerned about:
>>>>>>>
>>>>>>> 1. What about metadata?
>>>>>>
>>>>>> Good question, I don't think it includes the size used by metadata and I
>>>>>> don't know if there is a way to know it. I'll check better.
>>>>>
>>>>> It does not include the size of metadata, the "rbd_diff_iterate2"
>>>>> function is literally just looking for touched data blocks within the
>>>>> RBD image.
>>>>>
>>>>>>>
>>>>>>> 2. If you have multiple snapshots, this will only report the overall
>>>>>>> allocation information, right?  So say there is something like this:
>>>>>>>
>>>>>>> (“A” means an allocated MB, “-” is an unallocated MB)
>>>>>>>
>>>>>>> Snapshot 1: AAAA---
>>>>>>> Snapshot 2: --AAAAA
>>>>>>> Snapshot 3: -AAAA--
>>>>>>>
>>>>>>> I think the allocated data size is the number of As in total (13 MB).
>>>>>>> But I suppose this API will just return 7 MB, because it looks on
>>>>>>> everything an it sees the whole image range (7 MB) to be allocated.  It
>>>>>>> doesn’t report in how many snapshots some region is allocated.
>>>>>
>>>>> It should return 13 dirty data blocks (multipled by the size of the
>>>>> data block) since when you don't provide a "from snapshot" name, it
>>>>> will iterate from the first snapshot to the HEAD revision.
>>>>
>>>> Have you tested that?
>>>>
>>>> I‘m so skeptical because the callback function interface has no way of
>>>> distinguishing between different layers of snapshots.
>>>>
>>>> And also because we have the bdrv_block_status_above() function which
>>>> just looks strikingly similar (with the difference that it does not
>>>> invoke a callback but just returns the next allocated range).  If you
>>>> pass base=NULL to it, it will also “interpret that as the beginning of
>>>> time and return all allocated regions of the image” (or rather image
>>>> chain, in our case).  But it would just return 7 MB as allocated.  (Even
>>>> though it does in fact return layer information, i.e. where a given
>>>> continuous chunk of data is allocated.)
>>>>
>>>> Sure, there is no good reason for why our interface should by chance be
>>>> the same as librbd’s interface.  But without having tested it, the fact
>>>> that the callback cannot detect which layer a chunk is allocated on just
>>>> makes me wary.
>>>
>>> And the librbd code doesn’t alleviate my concerns.
>>>
>>> From what I can see, it first creates a bitmap (two bits per entry) that
>>> covers the whole image and says which objects are allocated and which
>>> aren‘t.  Through the whole chain, that is.  So in the above example, the
>>> bitmap would report everything as allocated.  (Or rather “updated“ in
>>> librbd‘s terms.)
>>>
>>> Then it has this piece:
>>>
>>>>   uint64_t off = m_offset;
>>>>   uint64_t left = m_length;
>>>>
>>>>   while (left > 0) {
>>>>     uint64_t period_off = off - (off % period);
>>>>     uint64_t read_len = min(period_off + period - off, left);
>>>>
>>>>     // map to extents
>>>>     map<object_t,vector<ObjectExtent> > object_extents;
>>>>     Striper::file_to_extents(cct, m_image_ctx.format_string,
>>>>                              &m_image_ctx.layout, off, read_len, 0,
>>>>                              object_extents, 0);
>>>>
>>>>     // get snap info for each object
>>>>     for (map<object_t,vector<ObjectExtent> >::iterator p =
>>>>            object_extents.begin();
>>>>          p != object_extents.end(); ++p)
>>> [...]
>>>>           for (std::vector<ObjectExtent>::iterator q = p->second.begin();
>>>>                q != p->second.end(); ++q) {
>>>>             r = m_callback(off + q->offset, q->length, updated, 
>>>> m_callback_arg);
>>> [...]
>>>>           }
>>> [...]
>>>>     }
>>>>>     left -= read_len;
>>>>     off += read_len;
>>>>   }
>>>
>>> So that iterates over the whole image (in our case, because m_offset is
>>> 0 and m_length is the image length), then picks out a chunk of read_len
>>> length, converts that to objects, and then iterates over all of those
>>> objects and all of their extents.
>>>
>>> file_to_extents looks like it’s just an arithmetic operation.  So it
>>> doesn‘t look like that function returns extents in multiple snapshots.
>>> It just converts a range into subranges called “objects” and “extents”
>>> (at least that’s the way it looks to me).
>>>
>>> So overall, this looks awfully like it simply iterates over the whole
>>> image and then returns allocation information gathered as a top-down
>>> view through all of the snapshots, but not for each snapshot individually.
>>
>> Sorry, you're correct. The API function was originally designed to
>> support delta extents for supporting diff exports, so while it does
>> open each snapshot's object-map, it only returns a unioned set of
>> delta extents. The rbd CLI's "disk-usage" action behaves how I
>> described by counting the actual dirty block usage between snapshots.
> 
> Max, Jason, thanks for the details!
> 
> If you agree, I'll try to implement something similar, iterating on all
> snapshots.


Yes, that should work.

> What about metadata?
> I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata.
> An idea could be to iterate over them and sum the keys-values size returned,
> but I'm not sure they are returned in the same format as they are stored.

I wouldn’t mind ignoring metadata too much.  If you can get something to
work, even better, but I don’t consider that too important.

Max

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback

Reply via email to