[ceph-users] OSD Crashes

Garg, Pankaj Fri, 29 Apr 2016 08:55:48 -0700

Hi,
I had a fully functional Ceph cluster with 3 x86 Nodes and 3 ARM64 nodes, each 
with 12 HDD Drives and 2SSD Drives. All these were initially running Hammer, 
and then were successfully updated to Infernalis (9.2.0).
I recently deleted all my OSDs and swapped my drives with new ones on the x86 
Systems, and the ARM servers were swapped with different ones (keeping drives 
same).
I again provisioned the OSDs, keeping the same cluster and Ceph versions as 
before. But now, every time I try to run RADOS bench, my OSDs start crashing 
(on both ARM and x86 servers).
I'm not sure why this is happening on all 6 systems. On the x86, it's the same 
Ceph bits as before, and the only thing different is the new drives.
It's the same stack (pasted below) on all the OSDs too.
Can anyone provide any clues?


Thanks
Pankaj





  -14> 2016-04-28 08:09:45.423950 7f1ef05b1700  1 -- 192.168.240.117:6820/14377 
<== osd.93 192.168.240.116:6811/47080 1236 ==== 
osd_repop(client.2794263.0:37721 284.6d4 
284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26) v1 
==== 981+0+4759 (3923326827 0 3705383247) 0x5634cbabc400 con 0x5634c5168420
   -13> 2016-04-28 08:09:45.423981 7f1ef05b1700  5 -- op tracker -- seq: 29404, 
time: 2016-04-28 08:09:45.423882, event: header_read, op: 
osd_repop(client.2794263.0:37721 284.6d4 
284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
   -12> 2016-04-28 08:09:45.423991 7f1ef05b1700  5 -- op tracker -- seq: 29404, 
time: 2016-04-28 08:09:45.423884, event: throttled, op: 
osd_repop(client.2794263.0:37721 284.6d4 
284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
   -11> 2016-04-28 08:09:45.423996 7f1ef05b1700  5 -- op tracker -- seq: 29404, 
time: 2016-04-28 08:09:45.423942, event: all_read, op: 
osd_repop(client.2794263.0:37721 284.6d4 
284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
   -10> 2016-04-28 08:09:45.424001 7f1ef05b1700  5 -- op tracker -- seq: 29404, 
time: 0.000000, event: dispatched, op: osd_repop(client.2794263.0:37721 284.6d4 
284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
    -9> 2016-04-28 08:09:45.424014 7f1ef05b1700  5 -- op tracker -- seq: 29404, 
time: 2016-04-28 08:09:45.424014, event: queued_for_pg, op: 
osd_repop(client.2794263.0:37721 284.6d4 
284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
    -8> 2016-04-28 08:09:45.561827 7f1f15799700  5 osd.102 12284 
tick_without_osd_lock
    -7> 2016-04-28 08:09:45.973944 7f1f0801a700  1 -- 
192.168.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306 ==== 
osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2 ==== 47+0+0 
(846632602 0 0) 0x5634c8305c00 con 0x5634c58dd760
    -6> 2016-04-28 08:09:45.973995 7f1f0801a700  1 -- 
192.168.240.117:6821/14377 --> 192.168.240.115:0/26572 -- osd_ping(ping_reply 
e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0 0x5634c7ba8000 con 
0x5634c58dd760
    -5> 2016-04-28 08:09:45.974300 7f1f0981d700  1 -- 10.18.240.117:6821/14377 
<== osd.73 192.168.240.115:0/26572 1306 ==== osd_ping(ping e12284 stamp 
2016-04-28 08:09:45.971751) v2 ==== 47+0+0 (846632602 0 0) 0x5634c8129400 con 
0x5634c58dcf20
    -4> 2016-04-28 08:09:45.974337 7f1f0981d700  1 -- 10.18.240.117:6821/14377 
--> 192.168.240.115:0/26572 -- osd_ping(ping_reply e12284 stamp 2016-04-28 
08:09:45.971751) v2 -- ?+0 0x5634c617d600 con 0x5634c58dcf20
    -3> 2016-04-28 08:09:46.174079 7f1f11f92700  0 
filestore(/var/lib/ceph/osd/ceph-102) write couldn't open 
287.6f9_head/287/ae33fef9/benchmark_data_ceph7_17591_object39895/head: (117) 
Structure needs cleaning
    -2> 2016-04-28 08:09:46.174103 7f1f11f92700  0 
filestore(/var/lib/ceph/osd/ceph-102)  error (117) Structure needs cleaning not 
handled on operation 0x5634c885df9e (16590.1.0, or op 0, counting from 0)
    -1> 2016-04-28 08:09:46.174109 7f1f11f92700  0 
filestore(/var/lib/ceph/osd/ceph-102) unexpected error code
     0> 2016-04-28 08:09:46.178707 7f1f11791700 -1 os/FileStore.cc: In function 
'int FileStore::lfn_open(coll_t, const ghobject_t&, bool, FDRef*, Index*)' 
thread 7f1f11791700 time 2016-04-28 08:09:46.173250
os/FileStore.cc: 335: FAILED assert(!m_filestore_fail_eio || r != -5)

ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) 
[0x5634c02ec7eb]
2: (FileStore::lfn_open(coll_t, ghobject_t const&, bool, 
std::shared_ptr<FDCache::FD>*, Index*)+0x1191) [0x5634bffb2d01]
3: (FileStore::_write(coll_t, ghobject_t const&, unsigned long, unsigned long, 
ceph::buffer::list const&, unsigned int)+0xf0) [0x5634bffbb7b0]
4: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, 
ThreadPool::TPHandle*)+0x2901) [0x5634bffc6f51]
5: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, 
std::allocator<ObjectStore::Transaction*> >&, unsigned long, 
ThreadPool::TPHandle*)+0x64) [0x5634bffcc404]
6: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x1a9) 
[0x5634bffcc5c9]
7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0x5634c02de10e]
8: (ThreadPool::WorkThread::entry()+0x10) [0x5634c02defd0]
9: (()+0x8182) [0x7f1f1f91a182]
10: (clone()+0x6d) [0x7f1f1dc6147d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD Crashes

Reply via email to