Hi Ceph Users !

I've got here a CEPH cluster: 6 nodes, 12 OSDs on HDD and SSD disks. All
journal OSDs on SSDs. 25 various HDDs in total.

We had several HDD failures in past, but every time - it was HDD failure
and it was never journal related. After replacing HDD, and recovery
procedures all was working again.

But now we've got double SSD failure. Two SSDs hosting journals went down,
so we lost 5 journals in total (out of 12).

Then we created new journals on another HDD, and added to the cluster. CEPH
started the recovery procedures and it was all looking good until 10
unfound objects were indicated. I tried to revert them by using: ceph pg
<PG> mark_unfound_lost revert, but it was unsuccessful. So I deleted them.
And from this moment on, two OSDs started to crash a lot:

5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x256) [0x5612e2e6c946]
6: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e)
[0x5612e28e48ee]
7: (ReplicatedPG::hit_set_persist()+0xd7d) [0x5612e28ead9d]

Worth to mention that cache tier is currently enabled on a separate pool.

I'm trying to flush&evict the cache, but it takes ages due to errors:

2016-12-08 11:52:00.007730 7f9e816bd700  0 -- NODE.2:0/3005344741 >>
NODE.8:6802/17445 pipe(0x7f9e780161e0 sd=8 :0 s=1 pgs=0 cs=0 l=1
c=0x7f9e7800ecf0).fault

Every time such error happens, NODE.8 OSD goes down. I suspect that there
is some inconsistency between cache and data OSD because of the SSD
failure. But I guess it can't flush the cache until data gets recovered,
and data can't be recovered because cache isn't flushed yet and data is
inconsistent:

log_channel(cluster) log [WRN] : pg 10.33 has invalid (post-split) stats;
must scrub before tier agent can activate

Any ideas ?

-- 
Wojtek
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to