Hi Igor,

Thanks for the valuable advice! I just wanted to provide feedback that it was 
indeed one single OSD causing the issues which I could triangulate as you said. 
After removing this OSD, the slow ops haven't occurred anymore.

Best regards,
Tim

> On 1 Oct 2024, at 12:42, Igor Fedotov <igor.fedo...@croit.io> wrote:
> 
> Hi Tim,
> 
> first of all - given the provided logs - all the slow operastions are stuck 
> in 'waiting for sub ops' state.
> 
> Which apparently means that reported OSDs aren't suffering from local issues 
> but stuck on replication operations to their peer OSDs.
> 
> From my experince even a single "faulty" osd could cause such issues to 
> multiple other daemons. And the way to troubleshoot is to find out what are 
> the actual culprit OSD(s).
> 
> To do that one might try to use the following approach:
> 
> 1. When (or shortly after) the issue is happening - run 'ceph daemon osd.N 
> dump_historic_ops' (or even 'dump_ops_in_flight') command against OSDs 
> reporting slow operations.
> 
> 2. From the above reports choose operations with extraordinary high duration, 
> e.g. > 5 seconds and learn PG ids they've been run against, e.g. PG = 1.a in 
> the following sample:
> 
>             "description": "osd_op(client.24184.0:23 >>>>1.a<<<<< 
> 1:54253539:::benchmark_data_coalmon_70932_object22:head [set-alloc-hint 
> object_size 4194304 write_size 4194304,write 0~4194304] snapc 0=[] 
> ondisk+write+known_if_redirected+supports_pool_eio e19)",
> 
> 3. For affected PG(s) learn which OSDs are backing specific it. E.g. by 
> running ceph pg map <pgid>
> 
> 4. If different PGs from the above step use specific OSD which is common to 
> all (the majority) of them - higly likely it's a good candidate for 
> additional investigation - partcularly relevant OSD logs inspection.
> 
> 
> Thanks,
> 
> Igor

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to