Hi Olli, On Tue, Jul 2, 2024 at 7:51 PM Olli Rajala <olli.raj...@anima.fi> wrote: > > Hi - mostly as a note to future me and if anyone else looking for the same > issue... > > I finally solved this a couple of months ago. No idea what is wrong with > Ceph but the root cause that was triggering this MDS issue was that I had > several workstations and a couple servers where the updatedb of "locate" > was getting run by daily cron exactly the same time every night causing > high momentary strain on the MDS which then somehow screwed up the metadata > caching and flushing creating this cumulative write io.
That's probably due to updatedb walking the ceph file system and also possibly triggering cap recalls from other clients. > > The thing to note here is that there's a difference with "locate" and > "mlocate" packages. The default config (on Ubuntu atleast) of updatedb for > "mlocate" does skip scanning cephfs filesystems but not so for "locate" > which happily ventures onto all of your cephfs mounts :| > > --------------------------- > Olli Rajala - Lead TD > Anima Vitae Ltd. > www.anima.fi > --------------------------- > > > On Wed, Dec 14, 2022 at 7:41 PM Olli Rajala <olli.raj...@anima.fi> wrote: > > > Hi, > > > > One thing I now noticed in the mds logs is that there's a ton of entries > > like this: > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache projecting to > > [d345,d346] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591 > > 694=484+210) > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result > > [d345,d346] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591 > > 695=484+211) > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache projecting to > > [d343,d344] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591 > > 694=484+210) > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result > > [d343,d344] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591 > > 695=484+211) > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache projecting to > > [d341,d342] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591 > > 694=484+210) > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result > > [d341,d342] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591 > > 695=484+211) > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache projecting to > > [d33f,d340] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591 > > 694=484+210) > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result > > [d33f,d340] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591 > > 695=484+211) > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache projecting to > > [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591 > > 694=484+210) > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result > > [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591 > > 695=484+211) > > > > ...and after dropping the caches considerably less of those - normal, > > abnormal, typical, atypical? ...or is that just something that starts > > happening after the cache gets filled? > > > > Tnx, > > --------------------------- > > Olli Rajala - Lead TD > > Anima Vitae Ltd. > > www.anima.fi > > --------------------------- > > > > > > On Sun, Dec 11, 2022 at 9:07 PM Olli Rajala <olli.raj...@anima.fi> wrote: > > > >> Hi, > >> > >> I'm still totally lost with this issue. And now lately I've had a couple > >> of incidents where the write bw has suddenly jumped to even crazier levels. > >> See the graph here: > >> https://gist.github.com/olliRJL/3e97e15a37e8e801a785a1bd5358120d > >> > >> The points where it drops to something manageable again are when I have > >> dropped the mds caches. Usually after the drop there is steady rise but now > >> these sudden jumps are something new and even more scary :E > >> > >> Here's a fresh 2sec level 20 mds log: > >> https://gist.github.com/olliRJL/074bec65787085e70db8af0ec35f8148 > >> > >> Any help and ideas greatly appreciated. Is there any tool or procedure to > >> safely check or rebuild the mds data? ...if this behaviour could be caused > >> by some hidden issue with the data itself. > >> > >> Tnx, > >> --------------------------- > >> Olli Rajala - Lead TD > >> Anima Vitae Ltd. > >> www.anima.fi > >> --------------------------- > >> > >> > >> On Fri, Nov 11, 2022 at 9:14 AM Venky Shankar <vshan...@redhat.com> > >> wrote: > >> > >>> On Fri, Nov 11, 2022 at 3:06 AM Olli Rajala <olli.raj...@anima.fi> > >>> wrote: > >>> > > >>> > Hi Venky, > >>> > > >>> > I have indeed observed the output of the different sections of perf > >>> dump like so: > >>> > watch -n 1 ceph tell mds.`hostname` perf dump objecter > >>> > watch -n 1 ceph tell mds.`hostname` perf dump mds_cache > >>> > ...etc... > >>> > > >>> > ...but without any proper understanding of what is a normal rate for > >>> some number to go up it's really difficult to make anything from that. > >>> > > >>> > btw - is there some convenient way to capture this kind of temporal > >>> output for others to view. Sure, I could just dump once a second to a file > >>> or sequential files but is there some tool or convention that is easy to > >>> look at and analyze? > >>> > >>> Not really - you'd have to do it yourself. > >>> > >>> > > >>> > Tnx, > >>> > --------------------------- > >>> > Olli Rajala - Lead TD > >>> > Anima Vitae Ltd. > >>> > www.anima.fi > >>> > --------------------------- > >>> > > >>> > > >>> > On Thu, Nov 10, 2022 at 8:18 AM Venky Shankar <vshan...@redhat.com> > >>> wrote: > >>> >> > >>> >> Hi Olli, > >>> >> > >>> >> On Mon, Oct 17, 2022 at 1:08 PM Olli Rajala <olli.raj...@anima.fi> > >>> wrote: > >>> >> > > >>> >> > Hi Patrick, > >>> >> > > >>> >> > With "objecter_ops" did you mean "ceph tell mds.pve-core-1 ops" > >>> and/or > >>> >> > "ceph tell mds.pve-core-1 objecter_requests"? Both these show very > >>> few > >>> >> > requests/ops - many times just returning empty lists. I'm pretty > >>> sure that > >>> >> > this I/O isn't generated by any clients - I've earlier tried to > >>> isolate > >>> >> > this by shutting down all cephfs clients and this didn't have any > >>> >> > noticeable effect. > >>> >> > > >>> >> > I tried to watch what is going on with that "perf dump" but to be > >>> honest > >>> >> > all I can see is some numbers going up in the different sections :) > >>> >> > ...don't have a clue what to focus on and how to interpret that. > >>> >> > > >>> >> > Here's a perf dump if you or anyone could make something out of > >>> that: > >>> >> > https://gist.github.com/olliRJL/43c10173aafd82be22c080a9cd28e673 > >>> >> > >>> >> You'd need to capture this over a period of time to see what ops might > >>> >> be going through and what the mds is doing. > >>> >> > >>> >> > > >>> >> > Tnx! > >>> >> > o. > >>> >> > > >>> >> > --------------------------- > >>> >> > Olli Rajala - Lead TD > >>> >> > Anima Vitae Ltd. > >>> >> > www.anima.fi > >>> >> > --------------------------- > >>> >> > > >>> >> > > >>> >> > On Fri, Oct 14, 2022 at 8:32 PM Patrick Donnelly < > >>> pdonn...@redhat.com> > >>> >> > wrote: > >>> >> > > >>> >> > > Hello Olli, > >>> >> > > > >>> >> > > On Thu, Oct 13, 2022 at 5:01 AM Olli Rajala <olli.raj...@anima.fi> > >>> wrote: > >>> >> > > > > >>> >> > > > Hi, > >>> >> > > > > >>> >> > > > I'm seeing constant 25-50MB/s writes to the metadata pool even > >>> when all > >>> >> > > > clients and the cluster is idling and in clean state. This > >>> surely can't > >>> >> > > be > >>> >> > > > normal? > >>> >> > > > > >>> >> > > > There's no apparent issues with the performance of the cluster > >>> but this > >>> >> > > > write rate seems excessive and I don't know where to look for > >>> the > >>> >> > > culprit. > >>> >> > > > > >>> >> > > > The setup is Ceph 16.2.9 running in hyperconverged 3 node core > >>> cluster > >>> >> > > and > >>> >> > > > 6 hdd osd nodes. > >>> >> > > > > >>> >> > > > Here's typical status when pretty much all clients are idling. > >>> Most of > >>> >> > > that > >>> >> > > > write bandwidth and maybe fifth of the write iops is hitting the > >>> >> > > > metadata pool. > >>> >> > > > > >>> >> > > > > >>> >> > > > >>> --------------------------------------------------------------------------------------------------- > >>> >> > > > root@pve-core-1:~# ceph -s > >>> >> > > > cluster: > >>> >> > > > id: 2088b4b1-8de1-44d4-956e-aa3d3afff77f > >>> >> > > > health: HEALTH_OK > >>> >> > > > > >>> >> > > > services: > >>> >> > > > mon: 3 daemons, quorum pve-core-1,pve-core-2,pve-core-3 > >>> (age 2w) > >>> >> > > > mgr: pve-core-1(active, since 4w), standbys: pve-core-2, > >>> pve-core-3 > >>> >> > > > mds: 1/1 daemons up, 2 standby > >>> >> > > > osd: 48 osds: 48 up (since 5h), 48 in (since 4M) > >>> >> > > > > >>> >> > > > data: > >>> >> > > > volumes: 1/1 healthy > >>> >> > > > pools: 10 pools, 625 pgs > >>> >> > > > objects: 70.06M objects, 46 TiB > >>> >> > > > usage: 95 TiB used, 182 TiB / 278 TiB avail > >>> >> > > > pgs: 625 active+clean > >>> >> > > > > >>> >> > > > io: > >>> >> > > > client: 45 KiB/s rd, 38 MiB/s wr, 6 op/s rd, 287 op/s wr > >>> >> > > > > >>> >> > > > >>> --------------------------------------------------------------------------------------------------- > >>> >> > > > > >>> >> > > > Here's some daemonperf dump: > >>> >> > > > > >>> >> > > > > >>> >> > > > >>> --------------------------------------------------------------------------------------------------- > >>> >> > > > root@pve-core-1:~# ceph daemonperf mds.`hostname -s` > >>> >> > > > > >>> >> > > > >>> ----------------------------------------mds----------------------------------------- > >>> >> > > > --mds_cache--- ------mds_log------ -mds_mem- > >>> -------mds_server------- > >>> >> > > mds_ > >>> >> > > > -----objecter------ purg > >>> >> > > > req rlat fwd inos caps exi imi hifc crev cgra ctru cfsa > >>> cfa hcc > >>> >> > > hccd > >>> >> > > > hccr prcr|stry recy recd|subm evts segs repl|ino dn |hcr > >>> hcs hsr cre > >>> >> > > > cat |sess|actv rd wr rdwr|purg| > >>> >> > > > 40 0 0 767k 78k 0 0 0 1 6 1 0 > >>> 0 5 5 > >>> >> > > > 3 7 |1.1k 0 0 | 17 3.7k 134 0 |767k 767k| 40 5 > >>> 0 0 > >>> >> > > > 0 |110 | 4 2 21 0 | 2 > >>> >> > > > 57 2 0 767k 78k 0 0 0 3 16 3 0 > >>> 0 11 11 > >>> >> > > > 0 17 |1.1k 0 0 | 45 3.7k 137 0 |767k 767k| 57 8 > >>> 0 0 > >>> >> > > > 0 |110 | 0 2 28 0 | 4 > >>> >> > > > 57 4 0 767k 78k 0 0 0 4 34 4 0 > >>> 0 34 33 > >>> >> > > > 2 26 |1.0k 0 0 |134 3.9k 139 0 |767k 767k| 57 13 > >>> 0 0 > >>> >> > > > 0 |110 | 0 2 112 0 | 19 > >>> >> > > > 67 3 0 767k 78k 0 0 0 6 32 6 0 > >>> 0 22 22 > >>> >> > > > 0 32 |1.1k 0 0 | 78 3.9k 141 0 |767k 768k| 67 4 > >>> 0 0 > >>> >> > > > 0 |110 | 0 2 56 0 | 2 > >>> >> > > > > >>> >> > > > >>> --------------------------------------------------------------------------------------------------- > >>> >> > > > Any ideas where to look at? > >>> >> > > > >>> >> > > Check the perf dump output of the mds: > >>> >> > > > >>> >> > > ceph tell mds.<fs_name>:0 perf dump > >>> >> > > > >>> >> > > over a period of time to identify what's going on. You can also > >>> look > >>> >> > > at the objecter_ops (another tell command) for the MDS. > >>> >> > > > >>> >> > > -- > >>> >> > > Patrick Donnelly, Ph.D. > >>> >> > > He / Him / His > >>> >> > > Principal Software Engineer > >>> >> > > Red Hat, Inc. > >>> >> > > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D > >>> >> > > > >>> >> > > > >>> >> > _______________________________________________ > >>> >> > ceph-users mailing list -- ceph-users@ceph.io > >>> >> > To unsubscribe send an email to ceph-users-le...@ceph.io > >>> >> > > >>> >> > >>> >> > >>> >> -- > >>> >> Cheers, > >>> >> Venky > >>> >> > >>> > >>> > >>> -- > >>> Cheers, > >>> Venky > >>> > >>> > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Cheers, Venky _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io