Hi all, I'm running a small converged Ceph cluster (mons, OSDs, mds on the same 3 hosts, one of which is also a CephFS client). I noticed that sometimes the CephFS client seems to deadlock. There are other CephFS clients on the network, but they generally don't collide in file/dir accesses much with the local client.
I found this old thread that describes a similar situation: https://www.mail-archive.com/search?l=ceph-us...@lists.ceph.com&q=subject:%22Re%5C%3A+%5C%5Bceph%5C-users%5C%5D+co%5C-located+cephfs+client+deadlock%22&o=newest&f=1 The kernel version is 6.14.6 for the main hosts, 6.14.8 for a few other unrelated clients, and 6.3.2 for one older client. Ceph version is 19.2.2. Ceph status was clean and dmesg didn't show any warnings around the time of the hang, other than a few of these (which I always get and seem to be related to SELinux attributes, but appear to be harmless): [3447915.841214] ceph: [667e28f4-8964-454b-b777-03ce12b67742 6505232]: sync 00000000f6e49a9a 100033726fb.152 during filling trace [3447915.851171] SELinux: inode_doinit_use_xattr: getxattr returned 16 for dev=ceph ino=1099565573883 Restarting the mds (letting it failover to the standby-replay mds) did not help the situation, but after finding the above thread I noticed that all the requests were stuck on the same OSD: > root@flamingo:~ -# cat /sys/kernel/debug/ceph/*/osdc > REQUESTS 17 homeless 0 > 46841260 osd17 14.2ab058d9 14.19s0 [17,18,15,6,7,4,3,0,1]/17 > [17,18,15,6,7,4,3,0,1]/17 e25593 10002e55314.0000003c 0x400014 > 1 read > 46841268 osd17 14.571aa108 14.8s0 [17,16,15,0,1,2,4,5,7]/17 > [17,16,15,0,1,2,4,5,7]/17 e25593 100030edef7.00000039 0x400014 > 1 read > 46841311 osd17 14.7b4e2d92 14.12s0 [17,16,15,6,7,4,2,1,3]/17 > [17,16,15,6,7,4,2,1,3]/17 e25593 1000318f566.00000179 0x400014 > 1 read > 46841323 osd17 14.deec222d 14.2ds0 [17,16,18,4,7,5,2,1,0]/17 > [17,16,18,4,7,5,2,1,0]/17 e25593 100030edf00.00000028 0x400014 > 1 read > 46841334 osd17 14.eb940708 14.8s0 [17,16,15,0,1,2,4,5,7]/17 > [17,16,15,0,1,2,4,5,7]/17 e25593 10003042460.000001c8 0x400014 > 1 read > 46841336 osd17 14.93065908 14.8s0 [17,16,15,0,1,2,4,5,7]/17 > [17,16,15,0,1,2,4,5,7]/17 e25593 1000318f566.00000181 0x400014 > 1 read > 46841350 osd17 14.3e9eeaa5 14.25s0 [17,18,15,2,0,1,7,6,4]/17 > [17,18,15,2,0,1,7,6,4]/17 e25593 10003498f94.00000000 0x400024 > 1 write > 46841359 osd17 14.5120499 14.19s0 [17,18,15,6,7,4,3,0,1]/17 > [17,18,15,6,7,4,3,0,1]/17 e25593 10002e53590.00000000 0x400014 > 1 read > 46841367 osd17 14.25e874a5 14.25s0 [17,18,15,2,0,1,7,6,4]/17 > [17,18,15,2,0,1,7,6,4]/17 e25593 1000318f566.00000196 0x400014 > 1 read > 46841376 osd17 14.cd26e465 14.25s0 [17,18,15,2,0,1,7,6,4]/17 > [17,18,15,2,0,1,7,6,4]/17 e25593 1000318f566.00000199 0x400014 > 1 read > 46841405 osd17 14.855295f5 14.35s0 [17,18,15,3,0,2,4,7,5]/17 > [17,18,15,3,0,2,4,7,5]/17 e25593 1000318f566.000001a7 0x400014 > 1 read > 46841421 osd17 14.57d5ef2c 14.2cs0 [17,18,16,1,3,2,4,6,5]/17 > [17,18,16,1,3,2,4,6,5]/17 e25593 10002fe07b1.00000008 0x400014 > 1 read > 46841424 osd17 14.ea554059 14.19s0 [17,18,15,6,7,4,3,0,1]/17 > [17,18,15,6,7,4,3,0,1]/17 e25593 1000318f566.000001ac 0x400014 > 1 read > 46841429 osd17 14.fa91b525 14.25s0 [17,18,15,2,0,1,7,6,4]/17 > [17,18,15,2,0,1,7,6,4]/17 e25593 100030916b5.0000004f 0x400014 > 1 read > 46841447 osd17 14.88f24788 14.8s0 [17,16,15,0,1,2,4,5,7]/17 > [17,16,15,0,1,2,4,5,7]/17 e25593 100032d95c7.00000011 0x400014 > 1 read > 46841468 osd17 12.159b3dd2 12.12 [17,5,2]/17 [17,5,2]/17 > e25593 10003307266.00000013 0x400014 1 read > 46841496 osd17 12.a08092d2 12.12 [17,5,2]/17 [17,5,2]/17 > e25593 10003493c3f.00000050 0x400014 1 read > LINGER REQUESTS > BACKOFFS Restarting that OSD fixed it. Looking at the OSD log at the exact time of the hang (determined by other logs), I see this: > 2025-07-18T06:39:06.992+0900 fffeed21bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _remove, latency = 5.162732124s, lat = 5s cid =12.3_head oid > =#12:c0ccf68d:::1000336b41c.0000009d:13d# > 2025-07-18T06:39:06.992+0900 fffeed21bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency slow operation observed for > submit_transact, latency = 5.167396545s > 2025-07-18T06:39:06.993+0900 fffef837bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _txc_committed_kv, latency = 5.168415070s, txc = 0xaaabe3f57000 > 2025-07-18T06:39:14.834+0900 fffeeca0bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _remove, latency = 5.340788364s, lat = 5s cid =12.14_head oid > =#12:28e0c837:::10003307273.00000024:13d# > 2025-07-18T06:39:14.835+0900 fffeeca0bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency slow operation observed for > submit_transact, latency = 5.342024803s > 2025-07-18T06:39:14.843+0900 fffef837bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _txc_committed_kv, latency = 5.349779129s, txc = 0xaaac19c75b00 > 2025-07-18T06:40:37.552+0900 fffeeb9ebf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _remove, latency = 5.124079704s, lat = 5s cid =12.c_head oid > =#12:340862a9:::100033073f1.0000002c:13d# > 2025-07-18T06:40:37.560+0900 fffeeb9ebf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency slow operation observed for > submit_transact, latency = 5.132128239s > 2025-07-18T06:40:37.568+0900 fffef837bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _txc_committed_kv, latency = 5.140548706s, txc = 0xaaabf25b8000 > 2025-07-18T06:40:59.228+0900 fffeec1fbf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _remove, latency = 5.677710056s, lat = 5s cid =12.2_head oid > =#12:40b82ac1:::1000348d274.00000008:13d# > 2025-07-18T06:40:59.229+0900 fffeec1fbf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency slow operation observed for > submit_transact, latency = 5.679266453s > 2025-07-18T06:40:59.232+0900 fffef837bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _txc_committed_kv, latency = 5.682381630s, txc = 0xaaac34479500 > 2025-07-18T06:41:01.799+0900 fffeeb9ebf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _remove, latency = 5.117489815s, lat = 5s cid =12.c_head oid > =#12:340e7251:::10003307273.00000027:13d# > 2025-07-18T06:41:01.799+0900 fffeeb9ebf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency slow operation observed for > submit_transact, latency = 5.118617058s > 2025-07-18T06:41:01.804+0900 fffef837bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _txc_committed_kv, latency = 5.123104095s, txc = 0xaaabbf99de00 > 2025-07-18T06:42:28.863+0900 fffeec1fbf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _remove, latency = 5.191506386s, lat = 5s cid =12.3_head oid > =#12:c2349a5a:::1000336b418.000000f0:13d# > 2025-07-18T06:42:28.863+0900 fffeec1fbf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency slow operation observed for > submit_transact, latency = 5.192711353s > 2025-07-18T06:42:28.864+0900 fffef837bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _txc_committed_kv, latency = 5.193384171s, txc = 0xaaabd1e51e00 > 2025-07-18T06:42:34.152+0900 fffeeca0bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _remove, latency = 5.287574291s, lat = 5s cid =12.3_head oid > =#12:c2bf668a:::10003307273.0000007b:13d# > 2025-07-18T06:42:34.154+0900 fffeeca0bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency slow operation observed for > submit_transact, latency = 5.289901257s > 2025-07-18T06:42:34.156+0900 fffef837bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _txc_committed_kv, latency = 5.291684628s, txc = 0xaaabf9f90c00 > 2025-07-18T06:42:40.801+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:42:40.801+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) [ repeats ... ] > 2025-07-18T06:44:15.954+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:44:15.954+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) > 2025-07-18T06:44:27.591+0900 fffeeca0bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _remove, latency = 5.561402321s, lat = 5s cid =12.0_head oid > =#12:064b3406:::1000348bd03.0000004a:13d# > 2025-07-18T06:44:27.594+0900 fffeeca0bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency slow operation observed for > submit_transact, latency = 5.565083027s > 2025-07-18T06:44:27.595+0900 fffef837bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _txc_committed_kv, latency = 5.565894604s, txc = 0xaaabe0e98c00 > 2025-07-18T06:44:29.442+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:44:29.442+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) [ repeats ... ] > 2025-07-18T06:44:32.362+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 2 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:44:32.362+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 2 slow requests (by type [ 'queued for pg' : 1 'started' : 1 ] most affected > pool [ 'cephfs2_data_hr3' : 1 ]) > 2025-07-18T06:44:33.314+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 2 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:44:33.314+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 2 slow requests (by type [ 'queued for pg' : 1 'started' : 1 ] most affected > pool [ 'cephfs2_data_hr3' : 1 ]) > 2025-07-18T06:44:34.328+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 2 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:44:34.328+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 2 slow requests (by type [ 'queued for pg' : 1 'started' : 1 ] most affected > pool [ 'cephfs2_data_hr3' : 1 ]) > 2025-07-18T06:44:38.361+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 2 slow ops, oldest is osd_op(client.6505232.0:46841376 14.25s0 > 14.cd26e465 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:44:38.361+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 2 slow requests (by type [ 'queued for pg' : 2 ] most affected pool [ > 'cephfs2_data_hec5.4' : 2 ]) > 2025-07-18T06:44:39.369+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 2 slow ops, oldest is osd_op(client.6505232.0:46841376 14.25s0 > 14.cd26e465 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:44:39.369+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 2 slow requests (by type [ 'queued for pg' : 2 ] most affected pool [ > 'cephfs2_data_hec5.4' : 2 ]) > 2025-07-18T06:44:40.356+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 5 slow ops, oldest is osd_op(client.6505232.0:46841260 14.19s0 > 14.2ab058d9 (undecoded) ondisk+read+known_if_redirected e25583) > 2025-07-18T06:44:40.356+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 5 slow requests (by type [ 'queued for pg' : 5 ] most affected pool [ > 'cephfs2_data_hec5.4' : 5 ]) > 2025-07-18T06:44:41.324+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:44:41.324+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) [ repeats ... ] > 2025-07-18T06:45:14.197+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:45:14.197+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) > 2025-07-18T06:45:15.825+0900 fffeec1fbf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _remove, latency = 5.245476246s, lat = 5s cid =12.b_head oid > =#12:d12c2386:::1000336b41a.00000095:13d# > 2025-07-18T06:45:15.826+0900 fffeec1fbf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency slow operation observed for > submit_transact, latency = 5.247073650s > 2025-07-18T06:45:15.831+0900 fffef837bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _txc_committed_kv, latency = 5.252201080s, txc = 0xaaabd1e50c00 > 2025-07-18T06:45:19.037+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:45:19.037+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) > 2025-07-18T06:45:20.056+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:45:20.056+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) > 2025-07-18T06:45:21.010+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:45:21.010+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) > 2025-07-18T06:45:21.963+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:45:21.963+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) > 2025-07-18T06:45:22.973+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 2 slow ops, oldest is osd_op(client.6505232.0:46841376 14.25s0 > 14.cd26e465 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:45:22.973+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 2 slow requests (by type [ 'queued for pg' : 2 ] most affected pool [ > 'cephfs2_data_hec5.4' : 2 ]) > 2025-07-18T06:45:23.958+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 2 slow ops, oldest is osd_op(client.6505232.0:46841376 14.25s0 > 14.cd26e465 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:45:23.958+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 2 slow requests (by type [ 'queued for pg' : 2 ] most affected pool [ > 'cephfs2_data_hec5.4' : 2 ]) > 2025-07-18T06:45:32.861+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) [ repeats ... ] > 2025-07-18T06:49:20.553+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) > 2025-07-18T06:49:21.547+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:49:21.547+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) > 2025-07-18T06:49:22.596+0900 ffff014cbf00 -1 osd.17 25584 get_health_metrics > reporting 1 slow ops, oldest is osd_op(client.6505232.0:46841429 14.25s0 > 14.fa91b525 (undecoded) ondisk+read+known_if_redirected e25584) > 2025-07-18T06:49:22.596+0900 ffff014cbf00 0 log_channel(cluster) log [WRN] : > 1 slow requests (by type [ 'queued for pg' : 1 ] most affected pool [ > 'cephfs2_data_hec5.4' : 1 ]) > 2025-07-18T06:49:48.204+0900 fffeec1fbf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _remove, latency = 5.027801514s, lat = 5s cid =12.1a_head oid > =#12:5b980d7b:::100033073d8.0000002a:13d# > 2025-07-18T06:49:48.205+0900 fffeec1fbf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency slow operation observed for > submit_transact, latency = 5.031510353s > 2025-07-18T06:49:48.214+0900 fffef837bf00 0 > bluestore(/var/lib/ceph/osd/ceph-17) log_latency_fn slow operation observed > for _txc_committed_kv, latency = 5.040008068s, txc = 0xaaabe5c0cf00 Note that osd17 is on a different host (but some of the shards of the affected PGs are on the same host). After this 10-minute period, the logs are clean. So it kind of looks like one op (46841429) above was a slow op for about 10 minutes, but then it stopped being considered slow by the OSD, yet the kernel client never considered the op completed and it got stuck there? Other ops from that client are also mentioned, and this doesn't seem right to me. Op 46841429 is the first one to become "slow", but then other ops like 46841376 are reported as the "oldest" op, and then it goes back to 46841429. Unless this is about different phases for the same op, it doesn't make sense to me that a younger op could be reported as slow, then an older op, then back to a younger op. If the older op is older, it should always be reported as slow first, right? I checked all the other OSD logs and there's nothing interesting other than a couple other OSDs that also got slow ops at around the same time (but much shorter duration). Grafana stats show a snaptrim operation started at exactly that time and lasted 10 minutes. The CPU logs show that this did repro the CPU usage/page cache thrashing anomaly I previously mentioned on the list (high sys >> user CPU usage), but this time the thrashing only lasted 10 minutes, so I would say this is unlikely to be directly caused by that bug, though it could be that the high load triggering slow ops in general then tickled another bug causing the ops to get stuck. Happy to help debug this if it happens again if anyone has pointers (I think this might not be the first time, but it's the first time I found that thread and learned about this issue specifically, and the debug file to check the culprit OSD). - Hector _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io