On Thu, Dec 15, 2022 at 9:32 AM Stolte, Felix <f.sto...@fz-juelich.de> wrote: > > Hi Patrick, > > we used your script to repair the damaged objects on the weekend and it went > smoothly. Thanks for your support. > > We adjusted your script to scan for damaged files on a daily basis, runtime > is about 6h. Until thursday last week, we had exactly the same 17 Files. On > thursday at 13:05 a snapshot was created and our active mds crashed once at > this time (snapshot was created): > > 2022-12-08T13:05:48.919+0100 7f440afec700 -1 > /build/ceph-16.2.10/src/mds/ScatterLock.h: In function 'void > ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7f440afec700 time > 2022-12-08T13:05:48.921223+0100 > /build/ceph-16.2.10/src/mds/ScatterLock.h: 59: FAILED ceph_assert(state > LOCK_XLOCK || state LOCK_XLOCKDONE) > > 12 Minutes lates the unlink_local error crashes appeared again. This time > with a new file. During debugging we noticed a MTU mismatch between MDS > (1500) and client (9000) with cephfs kernel mount. The client is also > creating the snapshots via mkdir in the .snap directory. > > We disabled snapshot creation for now, but really need this feature. I > uploaded the mds logs of the first crash along with the information above to > https://tracker.ceph.com/issues/38452 > > I would greatly appreciate it, if you could answer me the following question: > > Is the Bug related to our MTU Mismatch? We fixed the MTU Issue going back to > 1500 on all nodes in the ceph public network on the weekend also.
I doubt it. > If you need a debug level 20 log of the ScatterLock for further analysis, i > could schedule snapshots at the end of our workdays and increase the debug > level 5 Minutes arround snap shot creation. This would be very helpful! -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io