Re: [ceph-users] OSD Crashes

Samuel Just Fri, 29 Apr 2016 09:10:05 -0700

You could strace the process to see precisely what ceph-osd is doing
to provoke the EIO.
-Sam


On Fri, Apr 29, 2016 at 9:03 AM, Somnath Roy <somnath....@sandisk.com> wrote:
> Check system log and search for the corresponding drive. It should have the 
> information what is failing..
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Garg, Pankaj
> Sent: Friday, April 29, 2016 8:59 AM
> To: Samuel Just
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD Crashes
>
> I can see that. I guess what would that be symptomatic of? How is it doing 
> that on 6 different systems and on multiple OSDs?
>
> -----Original Message-----
> From: Samuel Just [mailto:sj...@redhat.com]
> Sent: Friday, April 29, 2016 8:57 AM
> To: Garg, Pankaj
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD Crashes
>
> Your fs is throwing an EIO on open.
> -Sam
>
> On Fri, Apr 29, 2016 at 8:54 AM, Garg, Pankaj 
> <pankaj.g...@caviumnetworks.com> wrote:
>> Hi,
>>
>> I had a fully functional Ceph cluster with 3 x86 Nodes and 3 ARM64
>> nodes, each with 12 HDD Drives and 2SSD Drives. All these were
>> initially running Hammer, and then were successfully updated to Infernalis 
>> (9.2.0).
>>
>> I recently deleted all my OSDs and swapped my drives with new ones on
>> the
>> x86 Systems, and the ARM servers were swapped with different ones
>> (keeping drives same).
>>
>> I again provisioned the OSDs, keeping the same cluster and Ceph
>> versions as before. But now, every time I try to run RADOS bench, my
>> OSDs start crashing (on both ARM and x86 servers).
>>
>> I’m not sure why this is happening on all 6 systems. On the x86, it’s
>> the same Ceph bits as before, and the only thing different is the new drives.
>>
>> It’s the same stack (pasted below) on all the OSDs too.
>>
>> Can anyone provide any clues?
>>
>>
>>
>> Thanks
>>
>> Pankaj
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>   -14> 2016-04-28 08:09:45.423950 7f1ef05b1700  1 --
>> 192.168.240.117:6820/14377 <== osd.93 192.168.240.116:6811/47080 1236
>> ====
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26) v1 ==== 981+0+4759 (3923326827 0 3705383247) 0x5634cbabc400
>> con 0x5634c5168420
>>
>>    -13> 2016-04-28 08:09:45.423981 7f1ef05b1700  5 -- op tracker -- seq:
>> 29404, time: 2016-04-28 08:09:45.423882, event: header_read, op:
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26)
>>
>>    -12> 2016-04-28 08:09:45.423991 7f1ef05b1700  5 -- op tracker -- seq:
>> 29404, time: 2016-04-28 08:09:45.423884, event: throttled, op:
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26)
>>
>>    -11> 2016-04-28 08:09:45.423996 7f1ef05b1700  5 -- op tracker -- seq:
>> 29404, time: 2016-04-28 08:09:45.423942, event: all_read, op:
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26)
>>
>>    -10> 2016-04-28 08:09:45.424001 7f1ef05b1700  5 -- op tracker -- seq:
>> 29404, time: 0.000000, event: dispatched, op:
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26)
>>
>>     -9> 2016-04-28 08:09:45.424014 7f1ef05b1700  5 -- op tracker -- seq:
>> 29404, time: 2016-04-28 08:09:45.424014, event: queued_for_pg, op:
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26)
>>
>>     -8> 2016-04-28 08:09:45.561827 7f1f15799700  5 osd.102 12284
>> tick_without_osd_lock
>>
>>     -7> 2016-04-28 08:09:45.973944 7f1f0801a700  1 --
>> 192.168.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306
>> ==== osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2 ====
>> 47+0+0
>> (846632602 0 0) 0x5634c8305c00 con 0x5634c58dd760
>>
>>     -6> 2016-04-28 08:09:45.973995 7f1f0801a700  1 --
>> 192.168.240.117:6821/14377 --> 192.168.240.115:0/26572 --
>> osd_ping(ping_reply e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0
>> 0x5634c7ba8000 con 0x5634c58dd760
>>
>>     -5> 2016-04-28 08:09:45.974300 7f1f0981d700  1 --
>> 10.18.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306 ====
>> osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2 ==== 47+0+0
>> (846632602 0 0) 0x5634c8129400 con 0x5634c58dcf20
>>
>>     -4> 2016-04-28 08:09:45.974337 7f1f0981d700  1 --
>> 10.18.240.117:6821/14377 --> 192.168.240.115:0/26572 --
>> osd_ping(ping_reply
>> e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0 0x5634c617d600 con
>> 0x5634c58dcf20
>>
>>     -3> 2016-04-28 08:09:46.174079 7f1f11f92700  0
>> filestore(/var/lib/ceph/osd/ceph-102) write couldn't open
>> 287.6f9_head/287/ae33fef9/benchmark_data_ceph7_17591_object39895/head:
>> (117) Structure needs cleaning
>>
>>     -2> 2016-04-28 08:09:46.174103 7f1f11f92700  0
>> filestore(/var/lib/ceph/osd/ceph-102)  error (117) Structure needs
>> cleaning not handled on operation 0x5634c885df9e (16590.1.0, or op 0,
>> counting from
>> 0)
>>
>>     -1> 2016-04-28 08:09:46.174109 7f1f11f92700  0
>> filestore(/var/lib/ceph/osd/ceph-102) unexpected error code
>>
>>      0> 2016-04-28 08:09:46.178707 7f1f11791700 -1 os/FileStore.cc: In
>> function 'int FileStore::lfn_open(coll_t, const ghobject_t&, bool,
>> FDRef*, Index*)' thread 7f1f11791700 time 2016-04-28 08:09:46.173250
>>
>> os/FileStore.cc: 335: FAILED assert(!m_filestore_fail_eio || r != -5)
>>
>>
>>
>> ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
>>
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x8b) [0x5634c02ec7eb]
>>
>> 2: (FileStore::lfn_open(coll_t, ghobject_t const&, bool,
>> std::shared_ptr<FDCache::FD>*, Index*)+0x1191) [0x5634bffb2d01]
>>
>> 3: (FileStore::_write(coll_t, ghobject_t const&, unsigned long,
>> unsigned long, ceph::buffer::list const&, unsigned int)+0xf0)
>> [0x5634bffbb7b0]
>>
>> 4: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned
>> long, int, ThreadPool::TPHandle*)+0x2901) [0x5634bffc6f51]
>>
>> 5: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*,
>> std::allocator<ObjectStore::Transaction*> >&, unsigned long,
>> ThreadPool::TPHandle*)+0x64) [0x5634bffcc404]
>>
>> 6: (FileStore::_do_op(FileStore::OpSequencer*,
>> ThreadPool::TPHandle&)+0x1a9) [0x5634bffcc5c9]
>>
>> 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
>> [0x5634c02de10e]
>>
>> 8: (ThreadPool::WorkThread::entry()+0x10) [0x5634c02defd0]
>>
>> 9: (()+0x8182) [0x7f1f1f91a182]
>>
>> 10: (clone()+0x6d) [0x7f1f1dc6147d]
>>
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD Crashes

Reply via email to