Hi, Cephers!

We have an issue on a Firefly production cluster: after a disk error, one osd 
was out of
the cluster. During a half of a hour, xfs async write tried to commit xfs 
journal to a
bad disk and a whole node get down with "BUG: cpu## soft lockup". We suspect, 
that it can be
a bug or strange interaction between xfs code, lsi code and lsi firmware. But 
as the cluster
is in production, the investigation of the root cause will be later.
We restart node, rejoin 8 of 10 OSDs to cluster and watching recovery process 
during monday.
Оne disk was physically dead and kicked out from RAID (we use single disk RAID0 
as OSD), 
the other lost its RAID metadata, the seems to be really strange.
But the issue is that now, after recovery almost comleted, we still have one PG 
in active+degraded
state during ~12 hours, and ceph doesnt try to recover it. In PG query we can 
see only two osds.
Restaring OSD makes this PG incomplete for some time, as we run pool with size 
3 min_size 2,
and, after rejoining OSD, PG returning in active+degraded state without any 
attempt to
backfill to 3 copies. What can you advise me to do with this PG to complete 
recovery?

Result of pg query:

http://pastebin.com/krfELqMs

Megov Igor
CIO, Yuterra
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to