On Thu, Aug 26, 2021 at 9:49 AM Frank Schilder <fr...@dtu.dk> wrote: > > Hi Dan, > > he he, I built a large omap object cluster, we are up to 5 now :) > > It is possible that our meta-data pool became a bottleneck. I'm re-deploying > OSDs on these disks at the moment, increasing the OSD count from 1 to 4. The > disks I use require high concurrency access to get close to spec performance > and a single OSD per disk doesn't get close to saturation (its Intel > enterprise NVMe-SSD SAS drives with really good performance specs). > Therefore, I don't see the disks themselves as a bottleneck in iostat or > atop, but it is very well possible that the OSD daemon is at its limit. It > will take a couple of days to complete this and I will report back. > > > This covers the topic and relevant config: > > https://docs.ceph.com/en/latest/cephfs/dirfrags/ > > This is a classic ceph documentation page: just numbers without units (size > of 10000 what??) without any explanation of how this would relate to object > sizes and/or key counts :) After reading it, I don't think we are looking at > dirfrags. The key-count is simply too large and the size probably as well. > Could it be MDS journals? What other objects might become large? Or, how > could I check what it is, for example, by looking at a hexdump? >
Taking one example: 2021-08-25 11:17:06.866726 osd.37 osd.37 192.168.32.77:6850/12306 644 : cluster [WRN] Large omap object found. Object: 12:05982a7e:::1000d7fd167.02800000:head PG: 12.7e5419a0 (12.20) Key count: 2293816 Size (bytes): 1078093520 This is inode 1000d7fd167, ie. 1099738108263 You can find this huge dir in the fs like `find /cephfs -type d -inum 1099738108263`. I expect it to be a huge directory. You can observe the contents of the dir via rados : rados -p cephfs_metadata listomapkeys 1000d7fd167.02800000 > I should mention that we have a bunch of super-aggressive clients on the FS. > Currently, I'm running 4 active MDS daemons and they seem to have distributed > the client load very well between each other by now. The aggressive clients > are probably open-foam or similar jobs that create millions and millions of > small files in very short time. I have seen peaks of 4-8K requests per second > to the MDSes. On our old Lustre system they managed to run out of inodes long > before the storage capacity was reached, its probably the worst data to inode > ratio one can think off. One of the advantages of ceph is its unlimited inode > capacity and it seems to cope with the usage pattern reasonably well - modulo > the problems I seem to observe here. Advise your clients to spread these millions of small files across many directories. in my experience users start to suffer once there are more than a few hundred thousand files in a directory. ("suffer" -- creating/deleting files and listing the dir starts to slow down substantially, especially if they are working in the same dir from many clients) -- dan > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Dan van der Ster <d...@vanderster.com> > Sent: 25 August 2021 15:46:27 > To: Frank Schilder > Cc: ceph-users > Subject: Re: [ceph-users] LARGE_OMAP_OBJECTS: any proper action possible? > > Hi, > > On Wed, Aug 25, 2021 at 2:37 PM Frank Schilder <fr...@dtu.dk> wrote: > > > > Hi Dan, > > > > > [...] Do you have some custom mds config in this area? > > > > none that I'm aware of. What MDS config parameters should I look for? > > This covers the topic and relevant config: > https://docs.ceph.com/en/latest/cephfs/dirfrags/ > > Here in our clusters we've never had to tune any of these options -- > it works well with the defaults on our hw/workloads. > > > I recently seem to have had problems with very slow dirfrag operations that > > made an MDS unresponsive long enough for a MON to kick it out. I had to > > increase the MDS beacon timeout to get out of an MDS restart loop (it also > > had oversized cache by the time I discovered the problem). The dirfrag was > > reported as a slow op warning. > > That sounds related. In our env I've never noticed slow dirfrag ops. > Do you have any underlying slowness or overload on your metadata osds? > > -- dan > > > > > > > Thanks and best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Dan van der Ster <d...@vanderster.com> > > Sent: 25 August 2021 14:05:00 > > To: Frank Schilder > > Cc: ceph-users > > Subject: Re: [ceph-users] LARGE_OMAP_OBJECTS: any proper action possible? > > > > Those are probably large directories; each omap key is a file/subdir > > in the directory. > > > > Normally the mds fragments dirs across several objects, so you > > shouldn't have a huge number of omap entries in any one single object. > > Do you have some custom mds config in this area? > > > > -- dan > > > > On Wed, Aug 25, 2021 at 2:01 PM Frank Schilder <fr...@dtu.dk> wrote: > > > > > > Hi Dan, > > > > > > thanks for looking at this. Here are the lines from health detail and > > > ceph.log: > > > > > > [root@gnosis ~]# ceph health detail > > > HEALTH_WARN 4 large omap objects > > > LARGE_OMAP_OBJECTS 4 large omap objects > > > 4 large objects found in pool 'con-fs2-meta1' > > > Search the cluster log for 'Large omap object found' for more details. > > > > > > The search gives: > > > > > > 2021-08-25 11:17:00.675474 osd.21 osd.21 192.168.32.77:6846/12302 651 : > > > cluster [WRN] Large omap object found. Object: > > > 12:373fb013:::1000eec35f5.01000000:head PG: 12.c80dfcec (12.6c) Key > > > count: 216000 Size (bytes): 101520000 > > > 2021-08-25 11:17:06.866726 osd.37 osd.37 192.168.32.77:6850/12306 644 : > > > cluster [WRN] Large omap object found. Object: > > > 12:05982a7e:::1000d7fd167.02800000:head PG: 12.7e5419a0 (12.20) Key > > > count: 2293816 Size (bytes): 1078093520 > > > 2021-08-25 11:17:11.152671 osd.37 osd.37 192.168.32.77:6850/12306 645 : > > > cluster [WRN] Large omap object found. Object: > > > 12:05da1450:::1000e118c0a.00000000:head PG: 12.a285ba0 (12.20) Key count: > > > 220612 Size (bytes): 103687640 > > > 2021-08-25 11:17:36.603664 osd.36 osd.36 192.168.32.75:6848/11882 1243 : > > > cluster [WRN] Large omap object found. Object: > > > 12:0b298d19:::1000eec35f7.04e00000:head PG: 12.98b194d0 (12.50) Key > > > count: 657212 Size (bytes): 308889640 > > > > > > They are all in the fs meta-data pool. > > > > > > Best regards, > > > ================= > > > Frank Schilder > > > AIT Risø Campus > > > Bygning 109, rum S14 > > > > > > ________________________________________ > > > From: Dan van der Ster <d...@vanderster.com> > > > Sent: 25 August 2021 13:57:44 > > > To: Frank Schilder > > > Cc: ceph-users > > > Subject: Re: [ceph-users] LARGE_OMAP_OBJECTS: any proper action possible? > > > > > > Hi Frank, > > > > > > Which objects are large? (You should see this in ceph.log when the > > > large obj was detected). > > > > > > -- dan > > > > > > On Wed, Aug 25, 2021 at 12:27 PM Frank Schilder <fr...@dtu.dk> wrote: > > > > > > > > Hi all, > > > > > > > > I have the notorious "LARGE_OMAP_OBJECTS: 4 large omap objects" warning > > > > and am again wondering if there is any proper action one can take > > > > except "wait it out and deep-scrub (numerous ceph-users threads)" or > > > > "ignore > > > > (https://docs.ceph.com/en/latest/rados/operations/health-checks/#large-omap-objects)". > > > > Only for RGWs is a proper action described, but mine come from MDSes. > > > > Is there any way to ask an MDS to clean up or split the objects? > > > > > > > > The disks with the meta-data pool can easily deal with objects of this > > > > size. My question is more along the lines: If I can't do anything > > > > anyway, why the warning? If there is a warning, I would assume that one > > > > can do something proper to prevent large omap objects from being born > > > > by an MDS. What is it? > > > > > > > > Best regards, > > > > ================= > > > > Frank Schilder > > > > AIT Risø Campus > > > > Bygning 109, rum S14 > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users@ceph.io > > > > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io