Hi Igor, Thanks for the valuable advice! I just wanted to provide feedback that it was indeed one single OSD causing the issues which I could triangulate as you said. After removing this OSD, the slow ops haven't occurred anymore.
Best regards, Tim > On 1 Oct 2024, at 12:42, Igor Fedotov <igor.fedo...@croit.io> wrote: > > Hi Tim, > > first of all - given the provided logs - all the slow operastions are stuck > in 'waiting for sub ops' state. > > Which apparently means that reported OSDs aren't suffering from local issues > but stuck on replication operations to their peer OSDs. > > From my experince even a single "faulty" osd could cause such issues to > multiple other daemons. And the way to troubleshoot is to find out what are > the actual culprit OSD(s). > > To do that one might try to use the following approach: > > 1. When (or shortly after) the issue is happening - run 'ceph daemon osd.N > dump_historic_ops' (or even 'dump_ops_in_flight') command against OSDs > reporting slow operations. > > 2. From the above reports choose operations with extraordinary high duration, > e.g. > 5 seconds and learn PG ids they've been run against, e.g. PG = 1.a in > the following sample: > > "description": "osd_op(client.24184.0:23 >>>>1.a<<<<< > 1:54253539:::benchmark_data_coalmon_70932_object22:head [set-alloc-hint > object_size 4194304 write_size 4194304,write 0~4194304] snapc 0=[] > ondisk+write+known_if_redirected+supports_pool_eio e19)", > > 3. For affected PG(s) learn which OSDs are backing specific it. E.g. by > running ceph pg map <pgid> > > 4. If different PGs from the above step use specific OSD which is common to > all (the majority) of them - higly likely it's a good candidate for > additional investigation - partcularly relevant OSD logs inspection. > > > Thanks, > > Igor _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io