[ceph-users] Re: KRBD: downside of setting alloc_size=4M for discard alignment?

韩云林 Fri, 25 Oct 2024 02:11:16 -0700
to unsubscribe <br/>退订
At 2024-10-25 15:57:03, "Friedrich Weber" <f.we...@proxmox.com> wrote:
>Hi,
>
>Some of our Proxmox VE users have noticed that a large fstrim inside a
>QEMU/KVM guest does not free up as much space as expected on the backing
>RBD image -- if the image is mapped on the host via KRBD and passed to
>QEMU as a block device (checked via `rbd du --exact`). If the image is
>attached via QEMU's librbd integration, fstrim seems to work much
>better. I've found an earlier discussion [0] according to which, for
>fstrim to work properly, the filesystem should be aligned a object size
>(4M) boundaries. Indeed, in the test setups I've looked at, the
>filesystem is not aligned to 4M boundaries.
>
>Still, I'm wondering if there might be a solution that doesn't require a
>specific partitioning/filesystem layout. To have a simpler test setup,
>I'm not looking at VMs and instead into unaligned blkdiscard on a
>KRBD-backed block device (on the host).
>
>On my test cluster (for versions see [5]), I create an 1G test volume,
>map it with default settings, write random data to it, and then issue
>blkdiscard with an 1M offset (see [1] for complete commands):
>
>> # blkdiscard --offset 1M /dev/rbd/vmpool/test
>
>An `rbd du --exact` reports a size of 256M:
>
>> # rbd du --exact -p vmpool test
>> NAME  PROVISIONED  USED
>> test        1 GiB  256 MiB
>
>Naively I would expect a result between 1 and 4M, my reasoning being
>that the 1023M discard could be split into 3M (to get to 4M alignment)
>plus 1020M. But I've checked the kernel's discard splitting logic [2],
>and as far as I understand it, it aims to align the discard requests to
>`discard_granularity`, which is 64k here:
>
>> /sys/class/block/rbd0/queue/discard_granularity:65536
>
>I've found I can set the `alloc_size` option [3] to 4M which sets
>`discard_granularity` to 4M. The result of the blkdiscard is much closer
>to my expectations (see [4] for complete commands).
>
>> # blkdiscard --offset 1M /dev/rbd/vmpool/test
>> # rbd du --exact -p vmpool test
>> NAME  PROVISIONED  USED
>> test        1 GiB  1 MiB
>
>However, apparently with `alloc_size` set to 4M, `minimum_io_size` is
>also set to 4M (it was 64k before, see [1]):
>
>> /sys/class/block/rbd0/queue/minimum_io_size:4194304
>
>My expectation is that this could negatively impact non-discard IO
>performance (write amplification?). But I am unsure, as I ran a few
>small benchmarks and couldn't really see any difference between the two
>settings. Thus, my questions:
>
>- Should I expect any downside for non-discard IO after setting
>`alloc_size` to 4M?
>- If yes: would it be feasible for KRBD to decouple
>`discard_granularity` and `minimum_io_size`, i.e., expose an option to
>set only `discard_granularity` to 4M?
>
>Happy about any pointers, and let me know if I can provide any further
>information.
>
>Thanks and best wishes,
>
>Friedrich
>
>[0] https://www.spinics.net/lists/ceph-users/msg67740.html
>[1]
>
>> # rbd create -p vmpool test --size 1G
>> # rbd map -p vmpool test
>> /dev/rbd0
>> # grep ''
>/sys/class/block/rbd0/queue/{discard_*,minimum_io_size,optimal_*}
>> /sys/class/block/rbd0/queue/discard_granularity:65536
>> /sys/class/block/rbd0/queue/discard_max_bytes:4194304
>> /sys/class/block/rbd0/queue/discard_max_hw_bytes:4194304
>> /sys/class/block/rbd0/queue/discard_zeroes_data:0
>> /sys/class/block/rbd0/queue/minimum_io_size:65536
>> /sys/class/block/rbd0/queue/optimal_io_size:4194304
>> # dd if=/dev/urandom of=/dev/rbd/vmpool/test bs=4M
>> dd: error writing '/dev/rbd/vmpool/test': No space left on device
>> 257+0 records in
>> 256+0 records out
>> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.73227 s, 227 MB/s
>> # rbd du --exact -p vmpool test
>> NAME  PROVISIONED  USED
>> test        1 GiB  1 GiB
>
>[2]
>https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-merge.c?h=v6.11&id=98f7e32f20d28ec452afb208f9cffc08448a2652#n108
>[3] https://docs.ceph.com/en/reef/man/8/rbd/
>
>[4]
>
>> # rbd map -p vmpool test -o alloc_size=4194304
>> /dev/rbd0
>> # grep '' /sys/class/block/rbd*/device/config_info
>> 10.1.1.201:6789,10.1.1.202:6789,10.1.1.203:6789
>name=admin,key=client.admin,alloc_size=4194304 vmpool test -
>> # grep ''
>/sys/class/block/rbd0/queue/{discard_*,minimum_io_size,optimal_*}
>> /sys/class/block/rbd0/queue/discard_granularity:4194304
>> /sys/class/block/rbd0/queue/discard_max_bytes:4194304
>> /sys/class/block/rbd0/queue/discard_max_hw_bytes:4194304
>> /sys/class/block/rbd0/queue/discard_zeroes_data:0
>> /sys/class/block/rbd0/queue/minimum_io_size:4194304
>> /sys/class/block/rbd0/queue/optimal_io_size:4194304
>> # dd if=/dev/urandom of=/dev/rbd/vmpool/test bs=4M
>> dd: error writing '/dev/rbd/vmpool/test': No space left on device
>> 257+0 records in
>> 256+0 records out
>> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.39016 s, 245 MB/s
>> # rbd du --exact -p vmpool test
>> NAME  PROVISIONED  USED
>> test        1 GiB  1 GiB
>
>[5]
>
>Host: Proxmox VE 8.2 but with Ubuntu mainline kernel 6.11 build
>(6.11.0-061100-generic from https://kernel.ubuntu.com/mainline/v6.11/)
>Ceph: Proxmox build of 18.2.4, but happy try a different build if needed.
>_______________________________________________
>ceph-users mailing list -- ceph-users@ceph.io
>To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: KRBD: downside of setting alloc_size=4M for discard alignment?

Reply via email to