could you check dmesg? I think there exists disk EIO error

On Tue, Oct 25, 2016 at 9:58 AM, Zhang Qiang <dotslash...@gmail.com> wrote:

> Hi,
>
> One of several OSDs on the same machine crashed several times within days.
> It's always that one, other OSDs are all fine. Below is the dumped message,
> since it's too long here, I only pasted the head and tail of the recent
> events. If it's necessary to inspect the full log, please see
> https://gist.github.com/dotSlashLu/3e8ca9491fbf07636a4583244ac23f80.
>
> 2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In function
> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t,
> ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time 2016-10-24
> 18:52:06.213123
> os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio ||
> got != -5)
>
>  ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0xbc9195]
>  2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned
> long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34]
>  3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int,
> ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1]
>  4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t,
> std::allocator<hobject_t> > const&, bool, unsigned int,
> ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8]
>  5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
> unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53]
>  6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2)
> [0x7df722]
>  7: (OSD::RepScrubWQ::_process(MOSDRepScrub*,
> ThreadPool::TPHandle&)+0xbe) [0x6dcade]
>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966]
>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0]
>  10: (()+0x7dc5) [0x7f309cd26dc5]
>  11: (clone()+0x6d) [0x7f309b80821d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- begin dump of recent events ---
> -10000> 2016-10-24 18:51:34.341035 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.56:6821/4808 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x175a2c00 con 0x1526a940
>  -9999> 2016-10-24 18:51:34.341046 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.61:6817/4808 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x175a3600 con 0x15269fa0
>  -9998> 2016-10-24 18:51:34.341058 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.56:6823/5402 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x12aaa400 con 0x27bc9080
>  -9997> 2016-10-24 18:51:34.341069 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.61:6821/5402 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x1f89ec00 con 0x27bc91e0
>  -9996> 2016-10-24 18:51:34.341080 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.56:6824/6216 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0xaa16000 con 0x175b0c00
>  -9995> 2016-10-24 18:51:34.341090 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.61:6818/6216 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x23b87800 con 0x175ae160
>  -9994> 2016-10-24 18:51:34.341101 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.57:6802/23367 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x258ed400 con 0x17500d60
>  -9993> 2016-10-24 18:51:34.341113 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.62:6806/23367 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x242bb000 con 0x175019c0
>  -9992> 2016-10-24 18:51:34.341128 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.57:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x28e41c00 con 0x1744aec0
>  -9991> 2016-10-24 18:51:34.341139 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.62:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x10be5200 con 0x175bf8c0
>  -9990> 2016-10-24 18:51:34.341130 7f3088a48700  1 -- 10.3.149.62:0/25857
> <== osd.1 10.3.149.55:6835/2010188 187557 ==== osd_ping(ping_reply e3014
> stamp 2016-10-24 18:51:34.340550) v2 ==== 47+0+0 (1550182756 0 0)
> 0x1a83bc00 con 0x7874580
>  -9989> 2016-10-24 18:51:34.341151 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.57:6814/26469 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x1f48aa00 con 0x175bfa20
>  -9988> 2016-10-24 18:51:34.341162 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.62:6811/26469 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x24456e00 con 0x175bfb80
>  -9987> 2016-10-24 18:51:34.341174 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.58:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x25c59e00 con 0x7874f20
>  -9986> 2016-10-24 18:51:34.341186 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.63:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x19703c00 con 0x7875760
>  -9985> 2016-10-24 18:51:34.341208 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.58:6803/2023356 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x19702600 con 0x26444940
>  -9984> 2016-10-24 18:51:34.341231 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.63:6803/2023356 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0xa67da00 con 0x7874c60
>  -9983> 2016-10-24 18:51:34.341249 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.58:6809/2023604 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x22111000 con 0x17887860
>  -9982> 2016-10-24 18:51:34.341262 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.63:6811/2023604 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x1fe62200 con 0x17887de0
>  -9981> 2016-10-24 18:51:34.341281 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.58:6802/2023892 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x1fc32c00 con 0x24246100
>  -9980> 2016-10-24 18:51:34.341297 7f307b22d700  1 -- 10.3.149.62:0/25857
> --> 10.3.149.63:6801/2023892 -- osd_ping(ping e3014 stamp 2016-10-24
> 18:51:34.340550) v2 -- ?+0 0x20544c00 con 0x24246d60
> .
> .
> .
>    -20> 2016-10-24 18:52:05.273121 7f3086243700  1 --
> 10.3.149.57:6811/25857 --> 10.3.149.60:0/10188 -- osd_ping(ping_reply
> e3014 stamp 2016-10-24 18:52:05.212809) v2 -- ?+0 0x27c1a600 con 0x1744aaa0
>    -19> 2016-10-24 18:52:05.273129 7f3087a46700  1 --
> 10.3.149.62:6810/25857 <== osd.1 10.3.149.60:0/10188 187279 ====
> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.212809) v2 ==== 47+0+0
> (387409057 0 0) 0x1ff4f600 con 0x175b1860
>    -18> 2016-10-24 18:52:05.273157 7f3087a46700  1 --
> 10.3.149.62:6810/25857 --> 10.3.149.60:0/10188 -- osd_ping(ping_reply
> e3014 stamp 2016-10-24 18:52:05.212809) v2 -- ?+0 0x10d73a00 con 0x175b1860
>    -17> 2016-10-24 18:52:05.641202 7f3086243700  1 --
> 10.3.149.57:6811/25857 <== osd.29 10.3.149.59:0/35501 187818 ====
> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.640915) v2 ==== 47+0+0
> (3027252596 0 0) 0x9d0a200 con 0x175172e0
>    -16> 2016-10-24 18:52:05.641209 7f3087a46700  1 --
> 10.3.149.62:6810/25857 <== osd.29 10.3.149.59:0/35501 187818 ====
> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.640915) v2 ==== 47+0+0
> (3027252596 0 0) 0xa27ba00 con 0x264422c0
>    -15> 2016-10-24 18:52:05.641246 7f3086243700  1 --
> 10.3.149.57:6811/25857 --> 10.3.149.59:0/35501 -- osd_ping(ping_reply
> e3014 stamp 2016-10-24 18:52:05.640915) v2 -- ?+0 0x1b8a6200 con 0x175172e0
>    -14> 2016-10-24 18:52:05.641290 7f3087a46700  1 --
> 10.3.149.62:6810/25857 --> 10.3.149.59:0/35501 -- osd_ping(ping_reply
> e3014 stamp 2016-10-24 18:52:05.640915) v2 -- ?+0 0x1ff4f600 con 0x264422c0
>    -13> 2016-10-24 18:52:05.689610 7f3086243700  1 --
> 10.3.149.57:6811/25857 <== osd.13 10.3.149.56:0/5402 187624 ====
> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.634215) v2 ==== 47+0+0
> (1310408758 0 0) 0x1be24600 con 0x15268b00
>    -12> 2016-10-24 18:52:05.689664 7f3086243700  1 --
> 10.3.149.57:6811/25857 --> 10.3.149.56:0/5402 -- osd_ping(ping_reply
> e3014 stamp 2016-10-24 18:52:05.634215) v2 -- ?+0 0x9d0a200 con 0x15268b00
>    -11> 2016-10-24 18:52:05.689661 7f3087a46700  1 --
> 10.3.149.62:6810/25857 <== osd.13 10.3.149.56:0/5402 187624 ====
> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.634215) v2 ==== 47+0+0
> (1310408758 0 0) 0x19705600 con 0x175b1de0
>    -10> 2016-10-24 18:52:05.689729 7f3087a46700  1 --
> 10.3.149.62:6810/25857 --> 10.3.149.56:0/5402 -- osd_ping(ping_reply
> e3014 stamp 2016-10-24 18:52:05.634215) v2 -- ?+0 0xa27ba00 con 0x175b1de0
>     -9> 2016-10-24 18:52:05.861925 7f3086243700  1 --
> 10.3.149.57:6811/25857 <== osd.4 10.3.149.60:0/12742 187653 ====
> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.801655) v2 ==== 47+0+0
> (350590821 0 0) 0x12169400 con 0x17514000
>     -8> 2016-10-24 18:52:05.861957 7f3086243700  1 --
> 10.3.149.57:6811/25857 --> 10.3.149.60:0/12742 -- osd_ping(ping_reply
> e3014 stamp 2016-10-24 18:52:05.801655) v2 -- ?+0 0x1be24600 con 0x17514000
>     -7> 2016-10-24 18:52:05.861963 7f3087a46700  1 --
> 10.3.149.62:6810/25857 <== osd.4 10.3.149.60:0/12742 187653 ====
> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.801655) v2 ==== 47+0+0
> (350590821 0 0) 0x269fba00 con 0x26442840
>     -6> 2016-10-24 18:52:05.862015 7f3087a46700  1 --
> 10.3.149.62:6810/25857 --> 10.3.149.60:0/12742 -- osd_ping(ping_reply
> e3014 stamp 2016-10-24 18:52:05.801655) v2 -- ?+0 0x19705600 con 0x26442840
>     -5> 2016-10-24 18:52:05.882605 7f3094bb6700  5 osd.19 3014 tick
>     -4> 2016-10-24 18:52:05.988572 7f3086243700  1 --
> 10.3.149.57:6811/25857 <== osd.25 10.3.149.58:0/24382 187898 ====
> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.984426) v2 ==== 47+0+0
> (3778423740 0 0) 0xae91200 con 0x177bb760
>     -3> 2016-10-24 18:52:05.988582 7f3087a46700  1 --
> 10.3.149.62:6810/25857 <== osd.25 10.3.149.58:0/24382 187898 ====
> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.984426) v2 ==== 47+0+0
> (3778423740 0 0) 0x1a396000 con 0x1526bc80
>     -2> 2016-10-24 18:52:05.988608 7f3086243700  1 --
> 10.3.149.57:6811/25857 --> 10.3.149.58:0/24382 -- osd_ping(ping_reply
> e3014 stamp 2016-10-24 18:52:05.984426) v2 -- ?+0 0x12169400 con 0x177bb760
>     -1> 2016-10-24 18:52:05.988652 7f3087a46700  1 --
> 10.3.149.62:6810/25857 --> 10.3.149.58:0/24382 -- osd_ping(ping_reply
> e3014 stamp 2016-10-24 18:52:05.984426) v2 -- ?+0 0x269fba00 con 0x1526bc80
>      0> 2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In
> function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
> size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time
> 2016-10-24 18:52:06.213123
> os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio ||
> got != -5)
>
>  ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0xbc9195]
>  2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned
> long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34]
>  3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int,
> ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1]
>  4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t,
> std::allocator<hobject_t> > const&, bool, unsigned int,
> ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8]
>  5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
> unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53]
>  6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2)
> [0x7df722]
>  7: (OSD::RepScrubWQ::_process(MOSDRepScrub*,
> ThreadPool::TPHandle&)+0xbe) [0x6dcade]
>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966]
>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0]
>  10: (()+0x7dc5) [0x7f309cd26dc5]
>  11: (clone()+0x6d) [0x7f309b80821d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 rbd_replay
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 keyvaluestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/10 civetweb
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>    0/ 0 refs
>    1/ 5 xio
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.19.log
> --- end dump of recent events ---
>
> Since ceph-osd objdump is too large to put in a mail, I will not attach
> it, but if it is needed i'll find a way to share it. What might be the
> cause? Can any one help me with this? Thanks.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to