I have a suspicion about what caused this.  Can you restart one of the problem 
osds with

debug osd = 20
debug filestore = 20
debug ms = 1

and attach the resulting log from startup to crash along with the osdmap binary 
(ceph osd getmap -o <mapfile>).
-Sam

----- Original Message -----
From: "Scott Laird" <sc...@sigkill.org>
To: "Robert LeBlanc" <rob...@leblancnet.us>
Cc: "'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com)" 
<ceph-users@lists.ceph.com>
Sent: Sunday, April 19, 2015 6:13:55 PM
Subject: Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer

Nope. Straight from 0.87 to 0.94.1. FWIW, at someone's suggestion, I just 
upgraded the kernel on one of the boxes from 3.14 to 3.18; no improvement. 
Rebooting didn't help, either. Still failing with the same error in the logs. 

On Sun, Apr 19, 2015 at 2:06 PM Robert LeBlanc < rob...@leblancnet.us > wrote: 



Did you upgrade from 0.92? If you did, did you flush the logs before upgrading? 

On Sun, Apr 19, 2015 at 1:02 PM, Scott Laird < sc...@sigkill.org > wrote: 



I'm upgrading from Giant to Hammer (0.94.1), and I'm seeing a ton of OSDs die 
(and stay dead) with this error in the logs: 

2015-04-19 11:53:36.796847 7f61fa900900 -1 osd/OSD.h: In function 'OSDMapRef 
OSDService::get_map(epoch_t)' thread 7f61fa900900 time 2015-04-19 
11:53:36.794951 
osd/OSD.h: 716: FAILED assert(ret) 

ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) 
[0xbc271b] 
2: (OSDService::get_map(unsigned int)+0x3f) [0x70923f] 
3: (OSD::load_pgs()+0x1769) [0x6c35d9] 
4: (OSD::init()+0x71f) [0x6c4c7f] 
5: (main()+0x2860) [0x651fc0] 
6: (__libc_start_main()+0xf5) [0x7f61f7a3fec5] 
7: /usr/bin/ceph-osd() [0x66aff7] 
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this. 

This is on a small cluster, with ~40 OSDs on 5 servers running Ubuntu 14.04. So 
far, every single server that I've upgraded has had at least one disk that has 
failed to restart with this error, and one has had several disks in this state. 

Restarting the OSD after it dies with this doesn't help. 

I haven't lost any data through this due to my slow rollout, but it's really 
annoying. 

Here are two full logs from OSDs on two different machines: 

https://dl.dropboxusercontent.com/u/104949139/ceph-osd.25.log 
https://dl.dropboxusercontent.com/u/104949139/ceph-osd.34.log 

Any suggestions? 


Scott 



_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to