Dear all,

this is a request to companies/consultants with development experience on ceph 
for a contract to help us out of our current file system outage. If you can 
offer help, please send a PM directly back to me.
Short description with links to what we found out already:

We experience a total outage of access to our ceph file system due to MDS rank 
2 getting hung on startup. It does not crash. This rank has 
.mds_cache.num_strays=99446248 and the initial processing of stray items leads 
to a producer-consumer deadlock that prevents this rank from becoming 
responsive. On startup approximately 37Mio stray items can be scheduled for 
purging before the dead lock occurs.

The cluster is healthy otherwise. Ceph version is 16.2.15 
(618f440892089921c3e944a991122ddc44e60516) pacific (stable).

We have deployed a rescue server with large amount of RAM and swap that we can 
make accessible remotely. On this rescue server we have an MDS up on rank 2 for 
trouble shooting and this is all working fine.

Related bug reports with more details:

https://tracker.ceph.com/issues/69547
https://www.spinics.net/lists/ceph-users/msg85394.html
https://www.spinics.net/lists/ceph-users/msg85480.html

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to