Hi, last night one of the PCI SSD drives that we use as a disk for OSD journal died, so we had to replace it, in this case for a 800GB SSD SATA Hard Disk. After recreating the journals 9 of the 11 OSDs of the server are not starting anymore (they start but after a minute, the OSD goes down).
Looking at the logs, I see that the service dies after a *** Caught signal (Aborted) ** message. extract from ceph-osd.6.log : -11> 2015-08-15 09:33:16.937820 7f32e6167700 5 -- op tracker -- seq: 484, time: 2015-08-15 09:33:16.937294, event: header_read, op: pg_info(1 pgs e10407:18.2c) -10> 2015-08-15 09:33:16.937822 7f32e6167700 5 -- op tracker -- seq: 484, time: 2015-08-15 09:33:16.937296, event: throttled, op: pg_info(1 pgs e10407:18.2c) -9> 2015-08-15 09:33:16.937826 7f32e6167700 5 -- op tracker -- seq: 484, time: 2015-08-15 09:33:16.937369, event: all_read, op: pg_info(1 pgs e10407:18.2c) -8> 2015-08-15 09:33:16.937830 7f32e6167700 5 -- op tracker -- seq: 484, time: 2015-08-15 09:33:16.937819, event: dispatched, op: pg_info(1 pgs e10407:18.2c) -7> 2015-08-15 09:33:16.937834 7f32e6167700 5 -- op tracker -- seq: 484, time: 2015-08-15 09:33:16.937834, event: waiting_for_osdmap, op: pg_info(1 pgs e10407:18.2c) -6> 2015-08-15 09:33:16.937837 7f32e6167700 5 -- op tracker -- seq: 484, time: 2015-08-15 09:33:16.937837, event: started, op: pg_info(1 pgs e10407:18.2c) -5> 2015-08-15 09:33:16.937848 7f32e6167700 5 -- op tracker -- seq: 484, time: 2015-08-15 09:33:16.937848, event: done, op: pg_info(1 pgs e10407:18.2c) -4> 2015-08-15 09:33:16.937860 7f32e6167700 1 -- 172.18.4.6:6800/21878 <== osd.11 172.18.4.7:6840/7934 44 ==== pg_info(1 pgs e10407:21.78) v4 ==== 759+0+0 (2132192505 0 0) 0x1da59fe0 con 0x15e9a520 -3> 2015-08-15 09:33:16.937869 7f32e6167700 5 -- op tracker -- seq: 485, time: 2015-08-15 09:33:16.928344, event: header_read, op: pg_info(1 pgs e10407:21.78) -2> 2015-08-15 09:33:16.937871 7f32e6167700 5 -- op tracker -- seq: 485, time: 2015-08-15 09:33:16.928346, event: throttled, op: pg_info(1 pgs e10407:21.78) -1> 2015-08-15 09:33:16.937876 7f32e6167700 5 -- op tracker -- seq: 485, time: 2015-08-15 09:33:16.928388, event: all_read, op: pg_info(1 pgs e10407:21.78) 0> 2015-08-15 09:33:16.937829 7f32de958700 -1 *** Caught signal (Aborted) ** in thread 7f32de958700 I have the full log in case you need more information Thank you for your help. -- *Francisco J. Araya Maggiolo* Devops Engineer & Cloud Specialist KIO Networks Mexico City Phone: +52 (55) 8503 2600 ext. 3901 Mobile: +52 (1) (55) 6066 9025 http://www.kionetworks.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com