On Thu, Aug 26, 2021 at 9:49 AM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Dan,
>
> he he, I built a large omap object cluster, we are up to 5 now :)
>
> It is possible that our meta-data pool became a bottleneck. I'm re-deploying 
> OSDs on these disks at the moment, increasing the OSD count from 1 to 4. The 
> disks I use require high concurrency access to get close to spec performance 
> and a single OSD per disk doesn't get close to saturation (its Intel 
> enterprise NVMe-SSD SAS drives with really good performance specs). 
> Therefore, I don't see the disks themselves as a bottleneck in iostat or 
> atop, but it is very well possible that the OSD daemon is at its limit. It 
> will take a couple of days to complete this and I will report back.
>
> > This covers the topic and relevant config:
> > https://docs.ceph.com/en/latest/cephfs/dirfrags/
>
> This is a classic ceph documentation page: just numbers without units (size 
> of 10000 what??) without any explanation of how this would relate to object 
> sizes and/or key counts :) After reading it, I don't think we are looking at 
> dirfrags. The key-count is simply too large and the size probably as well. 
> Could it be MDS journals? What other objects might become large? Or, how 
> could I check what it is, for example, by looking at a hexdump?
>

Taking one example:
2021-08-25 11:17:06.866726 osd.37 osd.37 192.168.32.77:6850/12306 644
: cluster [WRN] Large omap object found. Object:
12:05982a7e:::1000d7fd167.02800000:head PG: 12.7e5419a0 (12.20) Key
count: 2293816 Size (bytes): 1078093520

This is inode 1000d7fd167, ie. 1099738108263
You can find this huge dir in the fs like `find /cephfs -type d -inum
1099738108263`. I expect it to be a huge directory.

You can observe the contents of the dir via rados :

  rados -p cephfs_metadata listomapkeys 1000d7fd167.02800000


> I should mention that we have a bunch of super-aggressive clients on the FS. 
> Currently, I'm running 4 active MDS daemons and they seem to have distributed 
> the client load very well between each other by now. The aggressive clients 
> are probably open-foam or similar jobs that create millions and millions of 
> small files in very short time. I have seen peaks of 4-8K requests per second 
> to the MDSes. On our old Lustre system they managed to run out of inodes long 
> before the storage capacity was reached, its probably the worst data to inode 
> ratio one can think off. One of the advantages of ceph is its unlimited inode 
> capacity and it seems to cope with the usage pattern reasonably well - modulo 
> the problems I seem to observe here.

Advise your clients to spread these millions of small files across
many directories. in my experience users start to suffer once there
are more than a few hundred thousand files in a directory. ("suffer"
-- creating/deleting files and listing the dir starts to slow down
substantially, especially if they are working in the same dir from
many clients)

-- dan



>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <d...@vanderster.com>
> Sent: 25 August 2021 15:46:27
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] LARGE_OMAP_OBJECTS: any proper action possible?
>
> Hi,
>
> On Wed, Aug 25, 2021 at 2:37 PM Frank Schilder <fr...@dtu.dk> wrote:
> >
> > Hi Dan,
> >
> > > [...] Do you have some custom mds config in this area?
> >
> > none that I'm aware of. What MDS config parameters should I look for?
>
> This covers the topic and relevant config:
> https://docs.ceph.com/en/latest/cephfs/dirfrags/
>
> Here in our clusters we've never had to tune any of these options --
> it works well with the defaults on our hw/workloads.
>
> > I recently seem to have had problems with very slow dirfrag operations that 
> > made an MDS unresponsive long enough for a MON to kick it out. I had to 
> > increase the MDS beacon timeout to get out of an MDS restart loop (it also 
> > had oversized cache by the time I discovered the problem). The dirfrag was 
> > reported as a slow op warning.
>
> That sounds related. In our env I've never noticed slow dirfrag ops.
> Do you have any underlying slowness or overload on your metadata osds?
>
> -- dan
>
>
>
> >
> > Thanks and best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Dan van der Ster <d...@vanderster.com>
> > Sent: 25 August 2021 14:05:00
> > To: Frank Schilder
> > Cc: ceph-users
> > Subject: Re: [ceph-users] LARGE_OMAP_OBJECTS: any proper action possible?
> >
> > Those are probably large directories; each omap key is a file/subdir
> > in the directory.
> >
> > Normally the mds fragments dirs across several objects, so you
> > shouldn't have a huge number of omap entries in any one single object.
> > Do you have some custom mds config in this area?
> >
> > -- dan
> >
> > On Wed, Aug 25, 2021 at 2:01 PM Frank Schilder <fr...@dtu.dk> wrote:
> > >
> > > Hi Dan,
> > >
> > > thanks for looking at this. Here are the lines from health detail and 
> > > ceph.log:
> > >
> > > [root@gnosis ~]# ceph health detail
> > > HEALTH_WARN 4 large omap objects
> > > LARGE_OMAP_OBJECTS 4 large omap objects
> > >     4 large objects found in pool 'con-fs2-meta1'
> > >     Search the cluster log for 'Large omap object found' for more details.
> > >
> > > The search gives:
> > >
> > > 2021-08-25 11:17:00.675474 osd.21 osd.21 192.168.32.77:6846/12302 651 : 
> > > cluster [WRN] Large omap object found. Object: 
> > > 12:373fb013:::1000eec35f5.01000000:head PG: 12.c80dfcec (12.6c) Key 
> > > count: 216000 Size (bytes): 101520000
> > > 2021-08-25 11:17:06.866726 osd.37 osd.37 192.168.32.77:6850/12306 644 : 
> > > cluster [WRN] Large omap object found. Object: 
> > > 12:05982a7e:::1000d7fd167.02800000:head PG: 12.7e5419a0 (12.20) Key 
> > > count: 2293816 Size (bytes): 1078093520
> > > 2021-08-25 11:17:11.152671 osd.37 osd.37 192.168.32.77:6850/12306 645 : 
> > > cluster [WRN] Large omap object found. Object: 
> > > 12:05da1450:::1000e118c0a.00000000:head PG: 12.a285ba0 (12.20) Key count: 
> > > 220612 Size (bytes): 103687640
> > > 2021-08-25 11:17:36.603664 osd.36 osd.36 192.168.32.75:6848/11882 1243 : 
> > > cluster [WRN] Large omap object found. Object: 
> > > 12:0b298d19:::1000eec35f7.04e00000:head PG: 12.98b194d0 (12.50) Key 
> > > count: 657212 Size (bytes): 308889640
> > >
> > > They are all in the fs meta-data pool.
> > >
> > > Best regards,
> > > =================
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > >
> > > ________________________________________
> > > From: Dan van der Ster <d...@vanderster.com>
> > > Sent: 25 August 2021 13:57:44
> > > To: Frank Schilder
> > > Cc: ceph-users
> > > Subject: Re: [ceph-users] LARGE_OMAP_OBJECTS: any proper action possible?
> > >
> > > Hi Frank,
> > >
> > > Which objects are large? (You should see this in ceph.log when the
> > > large obj was detected).
> > >
> > > -- dan
> > >
> > > On Wed, Aug 25, 2021 at 12:27 PM Frank Schilder <fr...@dtu.dk> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I have the notorious "LARGE_OMAP_OBJECTS: 4 large omap objects" warning 
> > > > and am again wondering if there is any proper action one can take 
> > > > except "wait it out and deep-scrub (numerous ceph-users threads)" or 
> > > > "ignore 
> > > > (https://docs.ceph.com/en/latest/rados/operations/health-checks/#large-omap-objects)".
> > > >  Only for RGWs is a proper action described, but mine come from MDSes. 
> > > > Is there any way to ask an MDS to clean up or split the objects?
> > > >
> > > > The disks with the meta-data pool can easily deal with objects of this 
> > > > size. My question is more along the lines: If I can't do anything 
> > > > anyway, why the warning? If there is a warning, I would assume that one 
> > > > can do something proper to prevent large omap objects from being born 
> > > > by an MDS. What is it?
> > > >
> > > > Best regards,
> > > > =================
> > > > Frank Schilder
> > > > AIT Risø Campus
> > > > Bygning 109, rum S14
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to