Hi all !

We are facing strange behaviors from two clusters we have at work (both v15.2.9 
/ CentOS 7.9):


  *   In the 1st cluster we are getting errors about multiple degraded pgs and 
all of them are linked with a "rogue" osd which ID is very big (as 
"osd.2147483647"). This osd doesn't show with "ceph osd tree" and what is even 
weirder is that it doesn't always appear (about every 5/10 minutes)... but when 
it does, a lot of pgs get degraded.
  *
  *   In the 2nd cluster we are serving CephFS and after some users complaints, 
we saw that ceph-fuse is trying to connect to some osds on the wrong network 
(cluster network instead of public network). This behavior is random, about 90% 
of ceph-fuse connections to osds are on the public network but the rest try to 
access the osds through the cluster network. As the cluster network is not 
reachable from the clients, this make the connections go stale and the only way 
to recover from this is to "kill -9" the ceph-fuse mount.
  *
  *   Last thing we are facing on both clusters is when we add a new osd, half 
the time another one goes down on another server and the only way to make it 
back up again is to reweight it to 0, zap it and readd it (which can then lead 
to another osd failing...)

Any insights, suggestions, feedback would be greatly appreciated !

Best regards,

--
Thierry
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to