Hello everyone,

I have a cluster with 5 hosts and 18 OSDs, today I faced with a unexpected 
issue when multiple OSD goes down.

The first OSD go down, was osd.8, feel minutes after, another OSD goes down on 
the same host, the osd.1. So, I tried restart the OSDs (osd.8 and osd.1) but 
doesn’t worked and I decided put this OSDs out of cluster and wait the recovery 
complete.

During the recovery, more two OSDs goes down, osd.6 in another host… and 
seconds after, osd.0 on the same host that first osd goes down too.

Looking to the “ceph -w” status I realised some slow/stuck ops and I decided 
stop the writes on cluster. After that I restarted the OSDs 0 and 6 and bouth 
became UP and I was able to wait the recovery finish, which happened 
successfully.

I realised that when the first OSD goes down, the cluster was performing a 
deep-scrub and I found the bellow trace on the logs of osd.8, anyone can help 
me understand why the osd.8, and other osds, unexpected goes down?

Bellow the osd.8 trace:

    -2> 2015-03-03 16:31:48.191796 7f91a388b700  5 -- op tracker -- seq: 
2633606, time: 2015-03-03 16:31:48.191796, event: done, op: 
osd_op(client.3880912.0:236
8430 notify.6 [watch ping cookie 140352686583296] 40.97c520d4 
ack+write+known_if_redirected e4231)
    -1> 2015-03-03 16:31:48.192174 7f91af8a3700  1 -- 10.32.30.11:6804/3991 <== 
client.3880912 10.32.30.10:0/1001424 282597 ==== ping magic: 0 v1 ==== 0+0+0 (0
0 0) 0x3333f500 con 0x1535c580
     0> 2015-03-03 16:31:48.251131 7f91a0084700 -1 osd/ReplicatedPG.cc: In 
function 'void ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)' 
thread 7
f91a0084700 time 2015-03-03 16:31:48.169895
osd/ReplicatedPG.cc: 7494: FAILED assert(!i->mod_desc.empty())

 ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) 
[0xcc86c2]
 2: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x49c) 
[0x9624fc]
 3: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) 
[0x9698ba]
 4: (ReplicatedPG::_scrub(ScrubMap&)+0x2e62) [0x99b072]
 5: (PG::scrub_compare_maps()+0x511) [0x90f0d1]
 6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x204) [0x910bb4]
 7: (PG::scrub(ThreadPool::TPHandle&)+0x3a3) [0x912c53]
 8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x7ebdd3]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcbade9]
 10: (ThreadPool::WorkThread::entry()+0x10) [0xcbbfe0]
 11: (()+0x6b50) [0x7f91bfe46b50]
 12: (clone()+0x6d) [0x7f91be8627bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.


At.

Italo Santos
http://italosantos.com.br/

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to