Hi, today we had a disasterous crash - we are running a 3 node, 24 osd
in total cluster (8 each) with SSDs for blockdb, HDD for bluestore data.
This cluster is used as a radosgw backend, for storing a big number of
thumbnails for a file hosting site - around 110m files in total. We were
adding an interface to the nodes which required a restart, but after
restarting one of the nodes, a lot of the OSDs were kicked out of the
cluster and rgw stopped working. We have a lot of pgs down and unfound
atm. OSDs can't be started(aside from some, that's a mystery) with this
error - FAILED assert ( interval.last > last) - they just periodically
restart. So far, the cluster is broken and we can't seem to bring it
back up. We tried fscking the osds via the ceph objectstore tool, but it
was no good. The root of all this seems to be in the FAILED
assert(interval.last > last) error, however i can't find any info
regarding this or how to fix it. Did someone here also encounter it?
We're running luminous on ubuntu 16.04.
Thanks
Josef Zelenka
Cloudevelops
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com