could you check dmesg? I think there exists disk EIO error On Tue, Oct 25, 2016 at 9:58 AM, Zhang Qiang <dotslash...@gmail.com> wrote:
> Hi, > > One of several OSDs on the same machine crashed several times within days. > It's always that one, other OSDs are all fine. Below is the dumped message, > since it's too long here, I only pasted the head and tail of the recent > events. If it's necessary to inspect the full log, please see > https://gist.github.com/dotSlashLu/3e8ca9491fbf07636a4583244ac23f80. > > 2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In function > 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, > ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time 2016-10-24 > 18:52:06.213123 > os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || > got != -5) > > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x85) [0xbc9195] > 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned > long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34] > 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, > ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1] > 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, > std::allocator<hobject_t> > const&, bool, unsigned int, > ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8] > 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, > unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53] > 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2) > [0x7df722] > 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, > ThreadPool::TPHandle&)+0xbe) [0x6dcade] > 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966] > 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0] > 10: (()+0x7dc5) [0x7f309cd26dc5] > 11: (clone()+0x6d) [0x7f309b80821d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- begin dump of recent events --- > -10000> 2016-10-24 18:51:34.341035 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.56:6821/4808 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x175a2c00 con 0x1526a940 > -9999> 2016-10-24 18:51:34.341046 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.61:6817/4808 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x175a3600 con 0x15269fa0 > -9998> 2016-10-24 18:51:34.341058 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.56:6823/5402 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x12aaa400 con 0x27bc9080 > -9997> 2016-10-24 18:51:34.341069 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.61:6821/5402 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x1f89ec00 con 0x27bc91e0 > -9996> 2016-10-24 18:51:34.341080 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.56:6824/6216 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0xaa16000 con 0x175b0c00 > -9995> 2016-10-24 18:51:34.341090 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.61:6818/6216 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x23b87800 con 0x175ae160 > -9994> 2016-10-24 18:51:34.341101 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.57:6802/23367 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x258ed400 con 0x17500d60 > -9993> 2016-10-24 18:51:34.341113 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.62:6806/23367 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x242bb000 con 0x175019c0 > -9992> 2016-10-24 18:51:34.341128 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.57:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x28e41c00 con 0x1744aec0 > -9991> 2016-10-24 18:51:34.341139 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.62:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x10be5200 con 0x175bf8c0 > -9990> 2016-10-24 18:51:34.341130 7f3088a48700 1 -- 10.3.149.62:0/25857 > <== osd.1 10.3.149.55:6835/2010188 187557 ==== osd_ping(ping_reply e3014 > stamp 2016-10-24 18:51:34.340550) v2 ==== 47+0+0 (1550182756 0 0) > 0x1a83bc00 con 0x7874580 > -9989> 2016-10-24 18:51:34.341151 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.57:6814/26469 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x1f48aa00 con 0x175bfa20 > -9988> 2016-10-24 18:51:34.341162 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.62:6811/26469 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x24456e00 con 0x175bfb80 > -9987> 2016-10-24 18:51:34.341174 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.58:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x25c59e00 con 0x7874f20 > -9986> 2016-10-24 18:51:34.341186 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.63:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x19703c00 con 0x7875760 > -9985> 2016-10-24 18:51:34.341208 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.58:6803/2023356 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x19702600 con 0x26444940 > -9984> 2016-10-24 18:51:34.341231 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.63:6803/2023356 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0xa67da00 con 0x7874c60 > -9983> 2016-10-24 18:51:34.341249 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.58:6809/2023604 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x22111000 con 0x17887860 > -9982> 2016-10-24 18:51:34.341262 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.63:6811/2023604 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x1fe62200 con 0x17887de0 > -9981> 2016-10-24 18:51:34.341281 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.58:6802/2023892 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x1fc32c00 con 0x24246100 > -9980> 2016-10-24 18:51:34.341297 7f307b22d700 1 -- 10.3.149.62:0/25857 > --> 10.3.149.63:6801/2023892 -- osd_ping(ping e3014 stamp 2016-10-24 > 18:51:34.340550) v2 -- ?+0 0x20544c00 con 0x24246d60 > . > . > . > -20> 2016-10-24 18:52:05.273121 7f3086243700 1 -- > 10.3.149.57:6811/25857 --> 10.3.149.60:0/10188 -- osd_ping(ping_reply > e3014 stamp 2016-10-24 18:52:05.212809) v2 -- ?+0 0x27c1a600 con 0x1744aaa0 > -19> 2016-10-24 18:52:05.273129 7f3087a46700 1 -- > 10.3.149.62:6810/25857 <== osd.1 10.3.149.60:0/10188 187279 ==== > osd_ping(ping e3014 stamp 2016-10-24 18:52:05.212809) v2 ==== 47+0+0 > (387409057 0 0) 0x1ff4f600 con 0x175b1860 > -18> 2016-10-24 18:52:05.273157 7f3087a46700 1 -- > 10.3.149.62:6810/25857 --> 10.3.149.60:0/10188 -- osd_ping(ping_reply > e3014 stamp 2016-10-24 18:52:05.212809) v2 -- ?+0 0x10d73a00 con 0x175b1860 > -17> 2016-10-24 18:52:05.641202 7f3086243700 1 -- > 10.3.149.57:6811/25857 <== osd.29 10.3.149.59:0/35501 187818 ==== > osd_ping(ping e3014 stamp 2016-10-24 18:52:05.640915) v2 ==== 47+0+0 > (3027252596 0 0) 0x9d0a200 con 0x175172e0 > -16> 2016-10-24 18:52:05.641209 7f3087a46700 1 -- > 10.3.149.62:6810/25857 <== osd.29 10.3.149.59:0/35501 187818 ==== > osd_ping(ping e3014 stamp 2016-10-24 18:52:05.640915) v2 ==== 47+0+0 > (3027252596 0 0) 0xa27ba00 con 0x264422c0 > -15> 2016-10-24 18:52:05.641246 7f3086243700 1 -- > 10.3.149.57:6811/25857 --> 10.3.149.59:0/35501 -- osd_ping(ping_reply > e3014 stamp 2016-10-24 18:52:05.640915) v2 -- ?+0 0x1b8a6200 con 0x175172e0 > -14> 2016-10-24 18:52:05.641290 7f3087a46700 1 -- > 10.3.149.62:6810/25857 --> 10.3.149.59:0/35501 -- osd_ping(ping_reply > e3014 stamp 2016-10-24 18:52:05.640915) v2 -- ?+0 0x1ff4f600 con 0x264422c0 > -13> 2016-10-24 18:52:05.689610 7f3086243700 1 -- > 10.3.149.57:6811/25857 <== osd.13 10.3.149.56:0/5402 187624 ==== > osd_ping(ping e3014 stamp 2016-10-24 18:52:05.634215) v2 ==== 47+0+0 > (1310408758 0 0) 0x1be24600 con 0x15268b00 > -12> 2016-10-24 18:52:05.689664 7f3086243700 1 -- > 10.3.149.57:6811/25857 --> 10.3.149.56:0/5402 -- osd_ping(ping_reply > e3014 stamp 2016-10-24 18:52:05.634215) v2 -- ?+0 0x9d0a200 con 0x15268b00 > -11> 2016-10-24 18:52:05.689661 7f3087a46700 1 -- > 10.3.149.62:6810/25857 <== osd.13 10.3.149.56:0/5402 187624 ==== > osd_ping(ping e3014 stamp 2016-10-24 18:52:05.634215) v2 ==== 47+0+0 > (1310408758 0 0) 0x19705600 con 0x175b1de0 > -10> 2016-10-24 18:52:05.689729 7f3087a46700 1 -- > 10.3.149.62:6810/25857 --> 10.3.149.56:0/5402 -- osd_ping(ping_reply > e3014 stamp 2016-10-24 18:52:05.634215) v2 -- ?+0 0xa27ba00 con 0x175b1de0 > -9> 2016-10-24 18:52:05.861925 7f3086243700 1 -- > 10.3.149.57:6811/25857 <== osd.4 10.3.149.60:0/12742 187653 ==== > osd_ping(ping e3014 stamp 2016-10-24 18:52:05.801655) v2 ==== 47+0+0 > (350590821 0 0) 0x12169400 con 0x17514000 > -8> 2016-10-24 18:52:05.861957 7f3086243700 1 -- > 10.3.149.57:6811/25857 --> 10.3.149.60:0/12742 -- osd_ping(ping_reply > e3014 stamp 2016-10-24 18:52:05.801655) v2 -- ?+0 0x1be24600 con 0x17514000 > -7> 2016-10-24 18:52:05.861963 7f3087a46700 1 -- > 10.3.149.62:6810/25857 <== osd.4 10.3.149.60:0/12742 187653 ==== > osd_ping(ping e3014 stamp 2016-10-24 18:52:05.801655) v2 ==== 47+0+0 > (350590821 0 0) 0x269fba00 con 0x26442840 > -6> 2016-10-24 18:52:05.862015 7f3087a46700 1 -- > 10.3.149.62:6810/25857 --> 10.3.149.60:0/12742 -- osd_ping(ping_reply > e3014 stamp 2016-10-24 18:52:05.801655) v2 -- ?+0 0x19705600 con 0x26442840 > -5> 2016-10-24 18:52:05.882605 7f3094bb6700 5 osd.19 3014 tick > -4> 2016-10-24 18:52:05.988572 7f3086243700 1 -- > 10.3.149.57:6811/25857 <== osd.25 10.3.149.58:0/24382 187898 ==== > osd_ping(ping e3014 stamp 2016-10-24 18:52:05.984426) v2 ==== 47+0+0 > (3778423740 0 0) 0xae91200 con 0x177bb760 > -3> 2016-10-24 18:52:05.988582 7f3087a46700 1 -- > 10.3.149.62:6810/25857 <== osd.25 10.3.149.58:0/24382 187898 ==== > osd_ping(ping e3014 stamp 2016-10-24 18:52:05.984426) v2 ==== 47+0+0 > (3778423740 0 0) 0x1a396000 con 0x1526bc80 > -2> 2016-10-24 18:52:05.988608 7f3086243700 1 -- > 10.3.149.57:6811/25857 --> 10.3.149.58:0/24382 -- osd_ping(ping_reply > e3014 stamp 2016-10-24 18:52:05.984426) v2 -- ?+0 0x12169400 con 0x177bb760 > -1> 2016-10-24 18:52:05.988652 7f3087a46700 1 -- > 10.3.149.62:6810/25857 --> 10.3.149.58:0/24382 -- osd_ping(ping_reply > e3014 stamp 2016-10-24 18:52:05.984426) v2 -- ?+0 0x269fba00 con 0x1526bc80 > 0> 2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In > function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, > size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time > 2016-10-24 18:52:06.213123 > os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || > got != -5) > > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x85) [0xbc9195] > 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned > long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34] > 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, > ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1] > 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, > std::allocator<hobject_t> > const&, bool, unsigned int, > ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8] > 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, > unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53] > 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2) > [0x7df722] > 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, > ThreadPool::TPHandle&)+0xbe) [0x6dcade] > 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966] > 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0] > 10: (()+0x7dc5) [0x7f309cd26dc5] > 11: (clone()+0x6d) [0x7f309b80821d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 keyvaluestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.19.log > --- end dump of recent events --- > > Since ceph-osd objdump is too large to put in a mail, I will not attach > it, but if it is needed i'll find a way to share it. What might be the > cause? Can any one help me with this? Thanks. > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com