I have posted logs/strace from our osds with details to a ticket in the
ceph bug tracker - see here http://tracker.ceph.com/issues/21142. You
can see where exactly the OSDs crash etc, this can be of help if someone
decides to debug it.
JZ
On 10/01/18 22:05, Josef Zelenka wrote:
Hi, today we had a disasterous crash - we are running a 3 node, 24 osd
in total cluster (8 each) with SSDs for blockdb, HDD for bluestore
data. This cluster is used as a radosgw backend, for storing a big
number of thumbnails for a file hosting site - around 110m files in
total. We were adding an interface to the nodes which required a
restart, but after restarting one of the nodes, a lot of the OSDs were
kicked out of the cluster and rgw stopped working. We have a lot of
pgs down and unfound atm. OSDs can't be started(aside from some,
that's a mystery) with this error - FAILED assert ( interval.last >
last) - they just periodically restart. So far, the cluster is broken
and we can't seem to bring it back up. We tried fscking the osds via
the ceph objectstore tool, but it was no good. The root of all this
seems to be in the FAILED assert(interval.last > last) error, however
i can't find any info regarding this or how to fix it. Did someone
here also encounter it? We're running luminous on ubuntu 16.04.
Thanks
Josef Zelenka
Cloudevelops
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com