I see Xiubo started discussing this on https://tracker.ceph.com/issues/53542 as well.
So the large writes are going to the journal file, and sometimes it's s single write of a full segment size, which is what I was curious about. At this point the next step is seeing what is actually taking up that space. You could turn up logging and send in a snippet, but I think the easiest thing is going to involve: * track one of those 4 MB full-object writes Either a) pull the object in question off disk and look at it using ceph-dencoder, or b) Use cephfs-journal-tool to inspect the relevant journal range From your output below you could grab 200.02c7a084 (which is at journal offset 0x2c7a084*4MiB), though that's probably been expired by this point so you'll need to get another dump which contains a large one. I haven't looked at these data structures using these tools in a while so I'll leave more detail up to Xiubo. -Greg On Fri, Dec 10, 2021 at 12:48 AM Andras Sali <sali.and...@gmail.com> wrote: > > Hi Greg, > > As a follow up, we see items similar to this pop up in the objecter_requests > (when it's not empty). Not sure if reading it right, but some appear quite > large (in the MB range?): > > { > "ops": [ > { > "tid": 9532804, > "pg": "3.f9c235d7", > "osd": 2, > "object_id": "200.02c7a084", > "object_locator": "@3", > "target_object_id": "200.02c7a084", > "target_object_locator": "@3", > "paused": 0, > "used_replica": 0, > "precalc_pgid": 0, > "last_sent": "1121127.434264s", > "age": 0.016000104000000001, > "attempts": 1, > "snapid": "head", > "snap_context": "0=[]", > "mtime": "2021-12-10T08:35:34.582215+0000", > "osd_ops": [ > "write 0~4194304 [fadvise_dontneed] in=4194304b" > ] > }, > { > "tid": 9532806, > "pg": "3.abba2e66", > "osd": 2, > "object_id": "200.02c7a085", > "object_locator": "@3", > "target_object_id": "200.02c7a085", > "target_object_locator": "@3", > "paused": 0, > "used_replica": 0, > "precalc_pgid": 0, > "last_sent": "1121127.438264s", > "age": 0.012000078000000001, > "attempts": 1, > "snapid": "head", > "snap_context": "0=[]", > "mtime": "2021-12-10T08:35:34.589044+0000", > "osd_ops": [ > "write 0~1236893 [fadvise_dontneed] in=1236893b" > ] > }, > { > "tid": 9532807, > "pg": "3.abba2e66", > "osd": 2, > "object_id": "200.02c7a085", > "object_locator": "@3", > "target_object_id": "200.02c7a085", > "target_object_locator": "@3", > "paused": 0, > "used_replica": 0, > "precalc_pgid": 0, > "last_sent": "1121127.442264s", > "age": 0.0080000520000000006, > "attempts": 1, > "snapid": "head", > "snap_context": "0=[]", > "mtime": "2021-12-10T08:35:34.592283+0000", > "osd_ops": [ > "write 1236893~510649 [fadvise_dontneed] in=510649b" > ] > }, > { > "tid": 9532808, > "pg": "3.abba2e66", > "osd": 2, > "object_id": "200.02c7a085", > "object_locator": "@3", > "target_object_id": "200.02c7a085", > "target_object_locator": "@3", > "paused": 0, > "used_replica": 0, > "precalc_pgid": 0, > "last_sent": "1121127.442264s", > "age": 0.0080000520000000006, > "attempts": 1, > "snapid": "head", > "snap_context": "0=[]", > "mtime": "2021-12-10T08:35:34.592387+0000", > "osd_ops": [ > "write 1747542~13387 [fadvise_dontneed] in=13387b" > ] > } > ], > "linger_ops": [], > "pool_ops": [], > "pool_stat_ops": [], > "statfs_ops": [], > "command_ops": [] > } > > Any suggestions would be much appreciated. > > Kind regards, > > András > > > On Thu, Dec 9, 2021 at 7:48 PM Andras Sali <sali.and...@gmail.com> wrote: >> >> Hi Greg, >> >> Much appreciated for the reply, the image is also available at: >> https://tracker.ceph.com/attachments/download/5808/Bytes_per_op.png >> >> How the graph is generated: we back the cephfs metadata pool with Azure >> ultrassd disks. Azure reports for the disks each minute the average >> read/write iops (operations per sec) and average read/write throughput (in >> bytes per sec). >> >> We then divide the write throughput with the write IOPS number - this is the >> average write bytes / operation (we plot this in the above graph). We >> observe that this increases up to around 300kb, whilst after resetting the >> MDS, it stays around 32kb for some time (then starts increasing). The read >> bytes / operation are constantly much smaller. >> >> The issue is that once we are in the "high" regime, for the same operation >> that does for example 1000 IOPS, we need 300MB throughput, instead of 30MB >> throughput that we observe after a restart. The high throughput often >> results in reaching the VM level limits in Azure and after this the queue >> depth explodes and operations begin stalling. >> >> We will do the dump and report it as well once we have it. >> >> Thanks again for any ideas on this. >> >> Kind regards, >> >> Andras >> >> >> On Thu, Dec 9, 2021, 15:07 Gregory Farnum <gfar...@redhat.com> wrote: >>> >>> Andras, >>> >>> Unfortunately your attachment didn't come through the list. (It might >>> work if you embed it inline? Not sure.) I don't know if anybody's >>> looked too hard at this before, and without the image I don't know >>> exactly what metric you're using to say something's 320KB in size. Can >>> you explain more? >>> >>> It might help if you dump the objecter_requests from the MDS and share >>> those — it'll display what objects are being written to with what >>> sizes. >>> -Greg >>> >>> >>> On Wed, Dec 8, 2021 at 9:00 AM Andras Sali <sali.and...@gmail.com> wrote: >>> > >>> > Hi All, >>> > >>> > We have been observing that if we let our MDS run for some time, the >>> > bandwidth usage of the disks in the metadata pool starts increasing >>> > significantly (whilst IOPS is about constant), even though the number of >>> > clients, the workloads or anything else doesn't change. >>> > >>> > However, after restarting the MDS, the issue goes away for some time and >>> > the same workloads require 1/10th of the metadata disk bandwidth whilst >>> > doing the same IOPS. >>> > >>> > We run our CephFS cluster in a cloud environment where the disk throughput >>> > / bandwidth capacity is quite expensive to increase and we are hitting >>> > bandwidth / throughput limits, even though we still have a lot of IOPS >>> > capacity left. >>> > >>> > We suspect that somehow the journaling of the MDS becomes more extensive >>> > (i.e. larger journal updates for each operation), but we couldn't really >>> > pin down which parameter might affect this. >>> > >>> > I attach a plot of how the Bytes / Operation (throughput in MBps / IOPS) >>> > evolves over time, when we restart the MDS, it drops to around 32kb (even >>> > though the min block size for the metadata pool OSDs is 4kb in our >>> > settings) and then increases over time to around 300kb. >>> > >>> > Any ideas on how to "fix" this and have a significantly lower bandwidth >>> > usage would be really-really appreciated! >>> > >>> > Thank you and kind regards, >>> > >>> > Andras >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users@ceph.io >>> > To unsubscribe send an email to ceph-users-le...@ceph.io >>> > >>> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io