Hi,
I'm sure I'm doing something wrong, I hope someone can enlighten me...
I'm encountering many issues when I restart a ceph server (any ceph server).
This is on CentOS 7.2, ceph-0.94.6-0.el7.x86_64.
Firt : I have disabled abrt. I don't need abrt.
But when I restart, I see these logs in the systemd-udevd journal :
Apr 21 18:00:14 ceph4._snip_ python[1109]: detected unhandled Python exception
in '/usr/sbin/ceph-disk'
Apr 21 18:00:14 ceph4._snip_ python[1109]: can't communicate with ABRT daemon,
is it running? [Errno 2] No such file or directory
Apr 21 18:00:14 ceph4._snip_ python[1174]: detected unhandled Python exception
in '/usr/sbin/ceph-disk'
Apr 21 18:00:14 ceph4._snip_ python[1174]: can't communicate with ABRT daemon,
is it running? [Errno 2] No such file or directory
How could I possibly debug these exceptions ?
Could that be related to the osd hook that I'm using to put the SSDs in another
root in the crush map (that hook is a bash script, but it's calling another
helper python script that I made and which is trying to use megacli to identify
the SSDs on a non-jbod controller... tricky thing.) ?
Then, I see these kind of errors for most if not all drives :
Apr 21 18:00:47 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate
/dev/sdt1'(err) '2016-04-21 18:00:47.115322 7fc408ff9700 0 -- :/885104093 >>
__MON_IP__:6789/0 pipe(0x7fc400008280 sd=6 :0 s=1 pgs=0 cs=0 l=1
c=0x7fc400012670).fault'
Apr 21 18:00:50 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate
/dev/sdt1'(err) '2016-04-21 18:00:50.115543 7fc408ef8700 0 -- :/885104093 >>
__MON_IP__:6789/0 pipe(0x7fc400000c00 sd=6 :0 s=1 pgs=0 cs=0 l=1
c=0x7fc40000e1d0).fault'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate
/dev/sdt1'(out) 'failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf
--name=osd.113 --keyring=/var/lib/ceph/osd/ceph-113/keyring osd crush
create-or-move -- 113 1.81 host=ceph4 root=default''
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate
/dev/sdt1'(err) 'ceph-disk: Error: ceph osd start failed: Command
'['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.113']'
returned non-zero exit status 1'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate
/dev/sdt1' [1257] exit with return code 1
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: adding watch on '/dev/sdt1'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: created db file
'/run/udev/data/b65:49' for
'/devices/pci0000:00/0000:00:07.0/0000:03:00.0/host2/target2:2:6/2:2:6:0/block/sdt/sdt1'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: passed unknown number of bytes
to netlink monitor 0x7f4cec2f3240
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: seq 2553 processed with 0
Please note that at that time of the boot, I think there is still no network as
the interfaces are brought up later according to the network journal :
Apr 21 18:02:16 ceph4._snip_ network[2904]: Bringing up interface p2p1: [ OK
]
Apr 21 18:02:19 ceph4._snip_ network[2904]: Bringing up interface p2p2: [ OK
]
=> too bad for the OSD startups... I have to say I also disabled
NetworkManager, and I'm using static network configuration files... but I don't
know why the ceph init script would be called before network is up... ?
But even if I had network, I'm having another issue : I'm wondering wether I'm
hitting deadlocks somewhere...
Apr 21 18:01:10 ceph4._snip_ systemd-udevd[779]: worker [792]
/devices/pci0000:00/0000:00:07.0/0000:03:00.0/host2/target2:2:0/2:2:0:0/block/sdn/sdn2
is taking a long time
(...)
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk
activate-journal /dev/sdn2'(err) 'SG_IO: bad/missing sense data, sb[]: 70 00
05 00'
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk
activate-journal /dev/sdn2'(err) ' 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00'
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk
activate-journal /dev/sdn2'(err) ''
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk
activate-journal /dev/sdn2'(out) '=== osd.107 === '
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk
activate-journal /dev/sdn2'(err) '2016-04-21 18:01:54.707669 7f95801ac700 0 --
:/2141879112 >> __MON_IP__:6789/0 pipe(0x7f957c05f710 sd=4 :0 s=1 pgs=0 cs=0
l=1 c=0x7f957c05bb40).fault'
(...)
Apr 21 18:02:12 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk
activate-journal /dev/sdn2'(err) '2016-04-21 18:02:12.709053 7f95801ac700 0 --
:/2141879112 >> __MON_IP__:6789/0 pipe(0x7f9570008280 sd=4 :0 s=1 pgs=0 cs=0
l=1 c=0x7f95700056a0).fault'
Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk
activate-journal /dev/sdn2'(err) 'create-or-move updated item name 'osd.107'
weight 1.81 at location {host=ceph4,root=default} to crush map'
Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk
activate-journal /dev/sdn2'(out) 'Starting Ceph osd.107 on ceph4...'
Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk
activate-journal /dev/sdn2'(err) 'Running as unit
ceph-osd.107.1461254514.449704730.service.'
Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk
activate-journal /dev/sdn2' [1138] exit with return code 0
Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: adding watch on '/dev/sdn2'
Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: created db file
'/run/udev/data/b8:210' for
'/devices/pci0000:00/0000:00:07.0/0000:03:00.0/host2/target2:2:0/2:2:0:0/block/sdn/sdn2'
If I look at this specific osd journal, I see
Apr 21 18:02:16 ceph4._snip_ systemd[3137]: Executing: /bin/bash -c 'ulimit -n
32768; /usr/bin/ceph-osd -i 107 --pid-file /var/run/ceph/osd.107.pid -c
/etc/ceph/ceph.conf --cluster ceph -f'
Apr 21 18:02:16 ceph4._snip_ bash[3137]: 2016-04-21 18:02:16.147602
7f109f916880 -1 ** ERROR: unable to open OSD superblock on
/var/lib/ceph/osd/ceph-107: (2) No such file or directory
Apr 21 18:02:16 ceph4._snip_ systemd[1]: Child 3137 belongs to
ceph-osd.107.1461254514.449704730.service
I'm assuming this just means that the partition was not mounted correctly
because ceph-disk failed, and that the ceph OSD daemon died...?
After the boot... no OSD is up.
And if I run ceph-disk activate-all manually after the node booted (and gives
me ssh access, which indeed takes long)... everything gets up.
Any idea(s) ?
Thanks
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com