You could strace the process to see precisely what ceph-osd is doing to provoke the EIO. -Sam
On Fri, Apr 29, 2016 at 9:03 AM, Somnath Roy <somnath....@sandisk.com> wrote: > Check system log and search for the corresponding drive. It should have the > information what is failing.. > > Thanks & Regards > Somnath > > -----Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Garg, Pankaj > Sent: Friday, April 29, 2016 8:59 AM > To: Samuel Just > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] OSD Crashes > > I can see that. I guess what would that be symptomatic of? How is it doing > that on 6 different systems and on multiple OSDs? > > -----Original Message----- > From: Samuel Just [mailto:sj...@redhat.com] > Sent: Friday, April 29, 2016 8:57 AM > To: Garg, Pankaj > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] OSD Crashes > > Your fs is throwing an EIO on open. > -Sam > > On Fri, Apr 29, 2016 at 8:54 AM, Garg, Pankaj > <pankaj.g...@caviumnetworks.com> wrote: >> Hi, >> >> I had a fully functional Ceph cluster with 3 x86 Nodes and 3 ARM64 >> nodes, each with 12 HDD Drives and 2SSD Drives. All these were >> initially running Hammer, and then were successfully updated to Infernalis >> (9.2.0). >> >> I recently deleted all my OSDs and swapped my drives with new ones on >> the >> x86 Systems, and the ARM servers were swapped with different ones >> (keeping drives same). >> >> I again provisioned the OSDs, keeping the same cluster and Ceph >> versions as before. But now, every time I try to run RADOS bench, my >> OSDs start crashing (on both ARM and x86 servers). >> >> I’m not sure why this is happening on all 6 systems. On the x86, it’s >> the same Ceph bits as before, and the only thing different is the new drives. >> >> It’s the same stack (pasted below) on all the OSDs too. >> >> Can anyone provide any clues? >> >> >> >> Thanks >> >> Pankaj >> >> >> >> >> >> >> >> >> >> >> >> -14> 2016-04-28 08:09:45.423950 7f1ef05b1700 1 -- >> 192.168.240.117:6820/14377 <== osd.93 192.168.240.116:6811/47080 1236 >> ==== >> osd_repop(client.2794263.0:37721 284.6d4 >> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v >> 12284'26) v1 ==== 981+0+4759 (3923326827 0 3705383247) 0x5634cbabc400 >> con 0x5634c5168420 >> >> -13> 2016-04-28 08:09:45.423981 7f1ef05b1700 5 -- op tracker -- seq: >> 29404, time: 2016-04-28 08:09:45.423882, event: header_read, op: >> osd_repop(client.2794263.0:37721 284.6d4 >> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v >> 12284'26) >> >> -12> 2016-04-28 08:09:45.423991 7f1ef05b1700 5 -- op tracker -- seq: >> 29404, time: 2016-04-28 08:09:45.423884, event: throttled, op: >> osd_repop(client.2794263.0:37721 284.6d4 >> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v >> 12284'26) >> >> -11> 2016-04-28 08:09:45.423996 7f1ef05b1700 5 -- op tracker -- seq: >> 29404, time: 2016-04-28 08:09:45.423942, event: all_read, op: >> osd_repop(client.2794263.0:37721 284.6d4 >> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v >> 12284'26) >> >> -10> 2016-04-28 08:09:45.424001 7f1ef05b1700 5 -- op tracker -- seq: >> 29404, time: 0.000000, event: dispatched, op: >> osd_repop(client.2794263.0:37721 284.6d4 >> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v >> 12284'26) >> >> -9> 2016-04-28 08:09:45.424014 7f1ef05b1700 5 -- op tracker -- seq: >> 29404, time: 2016-04-28 08:09:45.424014, event: queued_for_pg, op: >> osd_repop(client.2794263.0:37721 284.6d4 >> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v >> 12284'26) >> >> -8> 2016-04-28 08:09:45.561827 7f1f15799700 5 osd.102 12284 >> tick_without_osd_lock >> >> -7> 2016-04-28 08:09:45.973944 7f1f0801a700 1 -- >> 192.168.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306 >> ==== osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2 ==== >> 47+0+0 >> (846632602 0 0) 0x5634c8305c00 con 0x5634c58dd760 >> >> -6> 2016-04-28 08:09:45.973995 7f1f0801a700 1 -- >> 192.168.240.117:6821/14377 --> 192.168.240.115:0/26572 -- >> osd_ping(ping_reply e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0 >> 0x5634c7ba8000 con 0x5634c58dd760 >> >> -5> 2016-04-28 08:09:45.974300 7f1f0981d700 1 -- >> 10.18.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306 ==== >> osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2 ==== 47+0+0 >> (846632602 0 0) 0x5634c8129400 con 0x5634c58dcf20 >> >> -4> 2016-04-28 08:09:45.974337 7f1f0981d700 1 -- >> 10.18.240.117:6821/14377 --> 192.168.240.115:0/26572 -- >> osd_ping(ping_reply >> e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0 0x5634c617d600 con >> 0x5634c58dcf20 >> >> -3> 2016-04-28 08:09:46.174079 7f1f11f92700 0 >> filestore(/var/lib/ceph/osd/ceph-102) write couldn't open >> 287.6f9_head/287/ae33fef9/benchmark_data_ceph7_17591_object39895/head: >> (117) Structure needs cleaning >> >> -2> 2016-04-28 08:09:46.174103 7f1f11f92700 0 >> filestore(/var/lib/ceph/osd/ceph-102) error (117) Structure needs >> cleaning not handled on operation 0x5634c885df9e (16590.1.0, or op 0, >> counting from >> 0) >> >> -1> 2016-04-28 08:09:46.174109 7f1f11f92700 0 >> filestore(/var/lib/ceph/osd/ceph-102) unexpected error code >> >> 0> 2016-04-28 08:09:46.178707 7f1f11791700 -1 os/FileStore.cc: In >> function 'int FileStore::lfn_open(coll_t, const ghobject_t&, bool, >> FDRef*, Index*)' thread 7f1f11791700 time 2016-04-28 08:09:46.173250 >> >> os/FileStore.cc: 335: FAILED assert(!m_filestore_fail_eio || r != -5) >> >> >> >> ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) >> >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x8b) [0x5634c02ec7eb] >> >> 2: (FileStore::lfn_open(coll_t, ghobject_t const&, bool, >> std::shared_ptr<FDCache::FD>*, Index*)+0x1191) [0x5634bffb2d01] >> >> 3: (FileStore::_write(coll_t, ghobject_t const&, unsigned long, >> unsigned long, ceph::buffer::list const&, unsigned int)+0xf0) >> [0x5634bffbb7b0] >> >> 4: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned >> long, int, ThreadPool::TPHandle*)+0x2901) [0x5634bffc6f51] >> >> 5: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, >> std::allocator<ObjectStore::Transaction*> >&, unsigned long, >> ThreadPool::TPHandle*)+0x64) [0x5634bffcc404] >> >> 6: (FileStore::_do_op(FileStore::OpSequencer*, >> ThreadPool::TPHandle&)+0x1a9) [0x5634bffcc5c9] >> >> 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) >> [0x5634c02de10e] >> >> 8: (ThreadPool::WorkThread::entry()+0x10) [0x5634c02defd0] >> >> 9: (()+0x8182) [0x7f1f1f91a182] >> >> 10: (clone()+0x6d) [0x7f1f1dc6147d] >> >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >> needed to interpret this. >> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com