[ceph-users] Help, ceph mons all crashed
Hi, I am currently facing a horrible situation. All my mons are crashing on startup. Here's a dump of mon.a.log. The last few ops are below. It seems to crash trying to remove a snap? Any ideas? - WP -10> 2014-03-06 17:04:38.838490 7fb2a541a700 1 -- 192.168.116.24:6789/0--> osd.9 192.168.116.27:6955/11604 -- mon_subscribe_ack(300s) v1 -- ?+0 0x32d0540 -9> 2014-03-06 17:04:38.838511 7fb2a541a700 1 -- 192.168.116.24:6789/0<== osd.1 192.168.116.24:6812/30221 6 mon_subscribe({monmap=14+,osd_pg_creates=0}) v2 50+0+0 (3009623588 0 0) 0x32d71c0 con 0x3224200 -8> 2014-03-06 17:04:38.838527 7fb2a541a700 1 -- 192.168.116.24:6789/0--> osd.1 192.168.116.24:6812/30221 -- mon_subscribe_ack(300s) v1 -- ?+0 0x32d3640 -7> 2014-03-06 17:04:38.838545 7fb2a541a700 1 -- 192.168.116.24:6789/0<== mon.2 192.168.116.26:6789/0 1868286886 forward(pg_stats(0 pgs tid 790 v 0) v1 caps allow rwx) to leader v1 398+0+0 (2470115819 0 0) 0x32f8c80 con 0x2f91760 -6> 2014-03-06 17:04:38.838570 7fb2a541a700 1 mon.a@0(leader).paxos(paxos active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838571 lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 -5> 2014-03-06 17:04:38.838595 7fb2a541a700 1 mon.a@0(leader).paxos(paxos active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838597 lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 -4> 2014-03-06 17:04:38.838626 7fb2a541a700 1 -- 192.168.116.24:6789/0<== mon.2 192.168.116.26:6789/0 1868286887 forward(osd_pgtemp(e26089 {6.fe=[]} v26089) v1 caps allow rwx) to leader v1 297+0+0 (2442013554 0 0) 0x32f8780 con 0x2f91760 -3> 2014-03-06 17:04:38.838662 7fb2a541a700 1 mon.a@0(leader).paxos(paxos active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838665 lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 -2> 2014-03-06 17:04:38.838696 7fb2a541a700 1 -- 192.168.116.24:6789/0<== mon.2 192.168.116.26:6789/0 1868286888 forward(pool_op(delete unmanaged snap pool 10 auid 0 tid 27 name v0) v4 caps allow r) to leader v1 313+0+0 (3176715156 0 0) 0x32f8500 con 0x2f91760 -1> 2014-03-06 17:04:38.838715 7fb2a541a700 1 mon.a@0(leader).paxos(paxos active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838717 lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 0> 2014-03-06 17:04:38.840833 7fb2a541a700 -1 osd/osd_types.cc: In function 'void pg_pool_t::remove_unmanaged_snap(snapid_t)' thread 7fb2a541a700 time 2014-03-06 17:04:38.838745 osd/osd_types.cc: 799: FAILED assert(is_unmanaged_snaps_mode()) ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) 1: /usr/bin/ceph-mon() [0x6c96e9] 2: (OSDMonitor::prepare_pool_op(MPoolOp*)+0x970) [0x5c3ad0] 3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1ab) [0x5c3d8b] 4: (PaxosService::dispatch(PaxosServiceMessage*)+0xa1a) [0x5940ea] 5: (Monitor::dispatch(MonSession*, Message*, bool)+0xdb) [0x56320b] 6: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb] 7: (Monitor::handle_forward(MForward*)+0x9c2) [0x565092] 8: (Monitor::dispatch(MonSession*, Message*, bool)+0x400) [0x563530] 9: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb] 10: (Monitor::ms_dispatch(Message*)+0x32) [0x57f212] 11: (DispatchQueue::entry()+0x582) [0x7de6c2] 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d994d] 13: (()+0x79d1) [0x7fb2ac05e9d1] 14: (clone()+0x6d) [0x7fb2aad95b6d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. - WP ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Help! All ceph mons crashed.
Hi, I am currently facing a horrible situation. All my mons are crashing on startup. Here's a dump of mon.a.log. The last few ops are below. It seems to crash trying to remove a snap? Any ideas? - WP -10> 2014-03-06 17:04:38.838490 7fb2a541a700 1 -- 192.168.116.24:6789/0 --> osd.9 192.168.116.27:6955/11604 -- mon_subscribe_ack(300s) v1 -- ?+0 0x32d0540 -9> 2014-03-06 17:04:38.838511 7fb2a541a700 1 -- 192.168.116.24:6789/0 <== osd.1 192.168.116.24:6812/30221 6 mon_subscribe({monmap=14+,osd_pg_creates=0}) v2 50+0+0 (3009623588 0 0) 0x32d71c0 con 0x3224200 -8> 2014-03-06 17:04:38.838527 7fb2a541a700 1 -- 192.168.116.24:6789/0 --> osd.1 192.168.116.24:6812/30221 -- mon_subscribe_ack(300s) v1 -- ?+0 0x32d3640 -7> 2014-03-06 17:04:38.838545 7fb2a541a700 1 -- 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286886 forward(pg_stats(0 pgs tid 790 v 0) v1 caps allow rwx) to leader v1 398+0+0 (2470115819 0 0) 0x32f8c80 con 0x2f91760 -6> 2014-03-06 17:04:38.838570 7fb2a541a700 1 mon.a@0(leader).paxos(paxos active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838571 lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 -5> 2014-03-06 17:04:38.838595 7fb2a541a700 1 mon.a@0(leader).paxos(paxos active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838597 lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 -4> 2014-03-06 17:04:38.838626 7fb2a541a700 1 -- 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286887 forward(osd_pgtemp(e26089 {6.fe=[]} v26089) v1 caps allow rwx) to leader v1 297+0+0 (2442013554 0 0) 0x32f8780 con 0x2f91760 -3> 2014-03-06 17:04:38.838662 7fb2a541a700 1 mon.a@0(leader).paxos(paxos active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838665 lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 -2> 2014-03-06 17:04:38.838696 7fb2a541a700 1 -- 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286888 forward(pool_op(delete unmanaged snap pool 10 auid 0 tid 27 name v0) v4 caps allow r) to leader v1 313+0+0 (3176715156 0 0) 0x32f8500 con 0x2f91760 -1> 2014-03-06 17:04:38.838715 7fb2a541a700 1 mon.a@0(leader).paxos(paxos active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838717 lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 0> 2014-03-06 17:04:38.840833 7fb2a541a700 -1 osd/osd_types.cc: In function 'void pg_pool_t::remove_unmanaged_snap(snapid_t)' thread 7fb2a541a700 time 2014-03-06 17:04:38.838745 osd/osd_types.cc: 799: FAILED assert(is_unmanaged_snaps_mode()) ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) 1: /usr/bin/ceph-mon() [0x6c96e9] 2: (OSDMonitor::prepare_pool_op(MPoolOp*)+0x970) [0x5c3ad0] 3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1ab) [0x5c3d8b] 4: (PaxosService::dispatch(PaxosServiceMessage*)+0xa1a) [0x5940ea] 5: (Monitor::dispatch(MonSession*, Message*, bool)+0xdb) [0x56320b] 6: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb] 7: (Monitor::handle_forward(MForward*)+0x9c2) [0x565092] 8: (Monitor::dispatch(MonSession*, Message*, bool)+0x400) [0x563530] 9: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb] 10: (Monitor::ms_dispatch(Message*)+0x32) [0x57f212] 11: (DispatchQueue::entry()+0x582) [0x7de6c2] 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d994d] 13: (()+0x79d1) [0x7fb2ac05e9d1] 14: (clone()+0x6d) [0x7fb2aad95b6d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help! All ceph mons crashed.
Ok, I think I got bitten by http://tracker.ceph.com/issues/7210, or rather, the cppool command in http://www.sebastien-han.fr/blog/2013/03/12/ceph-change-pg-number-on-the-fly/ I did use "rados cppool " in a pool with snapshots (openstack glance). A user feedback that ceph crashed when he deleted an image in openstack. I'm now wondering if I can ignore the operation, or the openstack glance pool, and get the mons to start up again. Any help will be greatly appreciated! - WP On Thu, Mar 6, 2014 at 5:33 PM, YIP Wai Peng wrote: > Hi, > > I am currently facing a horrible situation. All my mons are crashing on > startup. > > Here's a dump of mon.a.log. The last few ops are below. It seems to crash > trying to remove a snap? Any ideas? > > - WP > > >-10> 2014-03-06 17:04:38.838490 7fb2a541a700 1 -- > 192.168.116.24:6789/0 --> osd.9 192.168.116.27:6955/11604 -- > mon_subscribe_ack(300s) v1 -- ?+0 0x32d0540 > -9> 2014-03-06 17:04:38.838511 7fb2a541a700 1 -- > 192.168.116.24:6789/0 <== osd.1 192.168.116.24:6812/30221 6 > mon_subscribe({monmap=14+,osd_pg_creates=0}) v2 50+0+0 (3009623588 0 > 0) 0x32d71c0 con 0x3224200 > -8> 2014-03-06 17:04:38.838527 7fb2a541a700 1 -- > 192.168.116.24:6789/0 --> osd.1 192.168.116.24:6812/30221 -- > mon_subscribe_ack(300s) v1 -- ?+0 0x32d3640 > -7> 2014-03-06 17:04:38.838545 7fb2a541a700 1 -- > 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286886 > forward(pg_stats(0 pgs tid 790 v 0) v1 caps allow rwx) to leader v1 > 398+0+0 (2470115819 0 0) 0x32f8c80 con 0x2f91760 > -6> 2014-03-06 17:04:38.838570 7fb2a541a700 1 mon.a@0(leader).paxos(paxos > active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838571 > lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 > -5> 2014-03-06 17:04:38.838595 7fb2a541a700 1 mon.a@0(leader).paxos(paxos > active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838597 > lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 > -4> 2014-03-06 17:04:38.838626 7fb2a541a700 1 -- > 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286887 > forward(osd_pgtemp(e26089 {6.fe=[]} v26089) v1 caps allow rwx) to leader v1 > 297+0+0 (2442013554 0 0) 0x32f8780 con 0x2f91760 > -3> 2014-03-06 17:04:38.838662 7fb2a541a700 1 mon.a@0(leader).paxos(paxos > active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838665 > lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 > -2> 2014-03-06 17:04:38.838696 7fb2a541a700 1 -- > 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286888 > forward(pool_op(delete unmanaged snap pool 10 auid 0 tid 27 name v0) v4 > caps allow r) to leader v1 313+0+0 (3176715156 0 0) 0x32f8500 con > 0x2f91760 > -1> 2014-03-06 17:04:38.838715 7fb2a541a700 1 mon.a@0(leader).paxos(paxos > active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838717 > lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 > 0> 2014-03-06 17:04:38.840833 7fb2a541a700 -1 osd/osd_types.cc: In > function 'void pg_pool_t::remove_unmanaged_snap(snapid_t)' thread > 7fb2a541a700 time 2014-03-06 17:04:38.838745 > osd/osd_types.cc: 799: FAILED assert(is_unmanaged_snaps_mode()) > > ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) > 1: /usr/bin/ceph-mon() [0x6c96e9] > 2: (OSDMonitor::prepare_pool_op(MPoolOp*)+0x970) [0x5c3ad0] > 3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1ab) [0x5c3d8b] > 4: (PaxosService::dispatch(PaxosServiceMessage*)+0xa1a) [0x5940ea] > 5: (Monitor::dispatch(MonSession*, Message*, bool)+0xdb) [0x56320b] > 6: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb] > 7: (Monitor::handle_forward(MForward*)+0x9c2) [0x565092] > 8: (Monitor::dispatch(MonSession*, Message*, bool)+0x400) [0x563530] > 9: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb] > 10: (Monitor::ms_dispatch(Message*)+0x32) [0x57f212] > 11: (DispatchQueue::entry()+0x582) [0x7de6c2] > 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d994d] > 13: (()+0x79d1) [0x7fb2ac05e9d1] > 14: (clone()+0x6d) [0x7fb2aad95b6d] > NOTE: a copy of the executable, or `objdump -rdS ` is needed > to interpret this. > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [Solved]: Help! All ceph mons crashed.
I've managed to get joao's assistance in tracking down the issue. I'll be updating the bug 7210. Thanks joao and all! - WP On Thu, Mar 6, 2014 at 6:25 PM, YIP Wai Peng wrote: > Ok, I think I got bitten by http://tracker.ceph.com/issues/7210, or > rather, the cppool command in > http://www.sebastien-han.fr/blog/2013/03/12/ceph-change-pg-number-on-the-fly/ > > I did use "rados cppool " in a pool with snapshots > (openstack glance). A user feedback that ceph crashed when he deleted an > image in openstack. > > I'm now wondering if I can ignore the operation, or the openstack glance > pool, and get the mons to start up again. Any help will be greatly > appreciated! > > - WP > > > On Thu, Mar 6, 2014 at 5:33 PM, YIP Wai Peng wrote: > >> Hi, >> >> I am currently facing a horrible situation. All my mons are crashing on >> startup. >> >> Here's a dump of mon.a.log. The last few ops are below. It seems to crash >> trying to remove a snap? Any ideas? >> >> - WP >> >> >>-10> 2014-03-06 17:04:38.838490 7fb2a541a700 1 -- >> 192.168.116.24:6789/0 --> osd.9 192.168.116.27:6955/11604 -- >> mon_subscribe_ack(300s) v1 -- ?+0 0x32d0540 >> -9> 2014-03-06 17:04:38.838511 7fb2a541a700 1 -- >> 192.168.116.24:6789/0 <== osd.1 192.168.116.24:6812/30221 6 >> mon_subscribe({monmap=14+,osd_pg_creates=0}) v2 50+0+0 (3009623588 0 >> 0) 0x32d71c0 con 0x3224200 >> -8> 2014-03-06 17:04:38.838527 7fb2a541a700 1 -- >> 192.168.116.24:6789/0 --> osd.1 192.168.116.24:6812/30221 -- >> mon_subscribe_ack(300s) v1 -- ?+0 0x32d3640 >> -7> 2014-03-06 17:04:38.838545 7fb2a541a700 1 -- >> 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286886 >> forward(pg_stats(0 pgs tid 790 v 0) v1 caps allow rwx) to leader v1 >> 398+0+0 (2470115819 0 0) 0x32f8c80 con 0x2f91760 >> -6> 2014-03-06 17:04:38.838570 7fb2a541a700 1 >> mon.a@0(leader).paxos(paxos >> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838571 >> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 >> -5> 2014-03-06 17:04:38.838595 7fb2a541a700 1 >> mon.a@0(leader).paxos(paxos >> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838597 >> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 >> -4> 2014-03-06 17:04:38.838626 7fb2a541a700 1 -- >> 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286887 >> forward(osd_pgtemp(e26089 {6.fe=[]} v26089) v1 caps allow rwx) to leader v1 >> 297+0+0 (2442013554 0 0) 0x32f8780 con 0x2f91760 >> -3> 2014-03-06 17:04:38.838662 7fb2a541a700 1 >> mon.a@0(leader).paxos(paxos >> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838665 >> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 >> -2> 2014-03-06 17:04:38.838696 7fb2a541a700 1 -- >> 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286888 >> forward(pool_op(delete unmanaged snap pool 10 auid 0 tid 27 name v0) v4 >> caps allow r) to leader v1 313+0+0 (3176715156 0 0) 0x32f8500 con >> 0x2f91760 >> -1> 2014-03-06 17:04:38.838715 7fb2a541a700 1 >> mon.a@0(leader).paxos(paxos >> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838717 >> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906 >> 0> 2014-03-06 17:04:38.840833 7fb2a541a700 -1 osd/osd_types.cc: In >> function 'void pg_pool_t::remove_unmanaged_snap(snapid_t)' thread >> 7fb2a541a700 time 2014-03-06 17:04:38.838745 >> osd/osd_types.cc: 799: FAILED assert(is_unmanaged_snaps_mode()) >> >> ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) >> 1: /usr/bin/ceph-mon() [0x6c96e9] >> 2: (OSDMonitor::prepare_pool_op(MPoolOp*)+0x970) [0x5c3ad0] >> 3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1ab) [0x5c3d8b] >> 4: (PaxosService::dispatch(PaxosServiceMessage*)+0xa1a) [0x5940ea] >> 5: (Monitor::dispatch(MonSession*, Message*, bool)+0xdb) [0x56320b] >> 6: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb] >> 7: (Monitor::handle_forward(MForward*)+0x9c2) [0x565092] >> 8: (Monitor::dispatch(MonSession*, Message*, bool)+0x400) [0x563530] >> 9: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb] >> 10: (Monitor::ms_dispatch(Message*)+0x32) [0x57f212] >> 11: (DispatchQueue::entry()+0x582) [0x7de6c2] >> 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d994d] >> 13: (()+0x79d1) [0x7fb2ac05e9d1] >> 14: (clone()+0x6d) [0x7fb2aad95b6d] >> NOTE: a copy of the executable, or `objdump -rdS ` is needed >> to interpret this. >> >> > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] {Disarmed} Re: Ceph stops responding
I had this error yesterday. I had run out of storage at /var/lib/cepoh/mon/ at the local file system on the monitor. Kind regards, Jerker Nyberg On Wed, 5 Mar 2014, Georgios Dimitrakakis wrote: Can someone help me with this error: 2014-03-05 14:54:27.253711 7f654fd3d700 0=20 mon.client1@0(leader).data_health(96) update_stats avail 3% total=20 51606140 used 47174264 avail 1810436 2014-03-05 14:54:27.253916 7f654fd3d700 -1=20 mon.client1@0(leader).data_health(96) reached critical levels of=20 available space on data store -- shutdown! Why is it showing only 3% available when there is plenty of storage??? Best, G. On Wed, 5 Mar 2014 17:51:28 +0530, Srinivasa Rao Ragolu wrote: Ideal setup is node1 for mon, node2 is for OSD1 and node3 is OSD2 (Nodes can be VMs also). MDS is not required if you are not using flie system storage using ceph Please follow the blog for part 1 ,2 and 3 for detailed steps http://karan-mj.blogspot.in/2013/12/what-is-ceph-ceph-is-open-source.html [7] Follow each and every instruction on this blog Thanks,Srinivas. On Wed, Mar 5, 2014 at 3:44 PM, Georgios Dimitrakakis wrote: My setup consists of two nodes. The first node (master) is running: -mds -mon -osd.0 and the second node (CLIENT) is running: -osd.1 Therefore I ve restarted ceph services on both nodes Leaving the "ceph -w" running for as long as it can after a few seconds the error that is produced is this: 2014-03-05 12:08:17.715699 7fba13fff700 0 monclient: hunting for new mon 2014-03-05 12:08:17.716108 7fba102f8700 0 -- MAILSCANNER WARNING: NUMERICAL LINKS ARE OFTEN MALICIOUS: 192.168.0.10:0/1008298 [6] >> X.Y.Z.X:6789/0 pipe(0x7fba08008e50 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fba080090b0).fault (where X.Y.Z.X is the public IP of the CLIENT node). And it keep goes on... "ceph-health" after a few minutes shows the following 2014-03-05 12:12:58.355677 7effc52fb700 0 monclient(hunting): authenticate timed out after 300 2014-03-05 12:12:58.355717 7effc52fb700 0 librados: client.admin authentication error (110) Connection timed out Error connecting to cluster: TimedOut Any ideas now?? Best, G. First try to start OSD nodes by restarting the ceph service on ceph nodes. If it works file then you could able to see ceph-osd process running in process list. And do not need to add any public or private network in ceph.conf. If none of the OSDs run then you need to reconfigure them from monitor node. Please check ceph-mon process is running on monitor node or not? ceph-mds should not run. also check /etc/hosts file with valid ip address of cluster nodes Finally check ceph.client.admin.keyring and ceph.bootstrap-osd.keyring should be matched in all the cluster nodes. Best of luck. Srinivas. On Wed, Mar 5, 2014 at 3:04 PM, Georgios Dimitrakakis wrote: Hi! I have installed ceph and created two osds and was very happy with that but apparently not everything was correct. Today after a system reboot the cluster comes up and for a few moments it seems that its ok (using the "ceph health" command) but after a few seconds the "ceph health" command doesnt produce any output at all. It justs stays there without anything on the screen... ceph -w is doing the same as well... If I restart the ceph services ("service ceph restart") again for a few seconds is working but after a few more it stays frozen. Initially I thought that this was a firewall problem but apparently it isnt. Then I though that this had to do with the public_network cluster_network not defined in ceph.conf and changed that. No matter whatever I do the cluster works for a few seconds after the service restart and then it stops responding... Any help much appreciated!!! Best, G. ___ ceph-users mailing list ceph-users@lists.ceph.com [1] [1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2] [2] Links: -- [1] mailto:ceph-users@lists.ceph.com [3] [2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4] [3] mailto:gior...@acmac.uoc.gr [5] -- Links: -- [1] mailto:ceph-users@lists.ceph.com [2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3] mailto:ceph-users@lists.ceph.com [4] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [5] mailto:gior...@acmac.uoc.gr [6] http://192.168.0.10:0/1008298 [7] http://karan-mj.blogspot.in/2013/12/what-is-ceph-ceph-is-open-source.html [8] mailto:gior...@acmac.uoc.gr -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] {Disarmed} Re: Ceph stops responding
Good spot!!! The same problem here!!! Best, G. On Thu, 6 Mar 2014 12:28:26 +0100 (CET), Jerker Nyberg wrote: I had this error yesterday. I had run out of storage at /var/lib/cepoh/mon/ at the local file system on the monitor. Kind regards, Jerker Nyberg On Wed, 5 Mar 2014, Georgios Dimitrakakis wrote: Can someone help me with this error: 2014-03-05 14:54:27.253711 7f654fd3d700 0=20 mon.client1@0(leader).data_health(96) update_stats avail 3% total=20 51606140 used 47174264 avail 1810436 2014-03-05 14:54:27.253916 7f654fd3d700 -1=20 mon.client1@0(leader).data_health(96) reached critical levels of=20 available space on data store -- shutdown! Why is it showing only 3% available when there is plenty of storage??? Best, G. On Wed, 5 Mar 2014 17:51:28 +0530, Srinivasa Rao Ragolu wrote: Ideal setup is node1 for mon, node2 is for OSD1 and node3 is OSD2 (Nodes can be VMs also). MDS is not required if you are not using flie system storage using ceph Please follow the blog for part 1 ,2 and 3 for detailed steps http://karan-mj.blogspot.in/2013/12/what-is-ceph-ceph-is-open-source.html [7] Follow each and every instruction on this blog Thanks,Srinivas. On Wed, Mar 5, 2014 at 3:44 PM, Georgios Dimitrakakis wrote: My setup consists of two nodes. The first node (master) is running: -mds -mon -osd.0 and the second node (CLIENT) is running: -osd.1 Therefore I ve restarted ceph services on both nodes Leaving the "ceph -w" running for as long as it can after a few seconds the error that is produced is this: 2014-03-05 12:08:17.715699 7fba13fff700 0 monclient: hunting for new mon 2014-03-05 12:08:17.716108 7fba102f8700 0 -- MAILSCANNER WARNING: NUMERICAL LINKS ARE OFTEN MALICIOUS: 192.168.0.10:0/1008298 [6] >> X.Y.Z.X:6789/0 pipe(0x7fba08008e50 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fba080090b0).fault (where X.Y.Z.X is the public IP of the CLIENT node). And it keep goes on... "ceph-health" after a few minutes shows the following 2014-03-05 12:12:58.355677 7effc52fb700 0 monclient(hunting): authenticate timed out after 300 2014-03-05 12:12:58.355717 7effc52fb700 0 librados: client.admin authentication error (110) Connection timed out Error connecting to cluster: TimedOut Any ideas now?? Best, G. First try to start OSD nodes by restarting the ceph service on ceph nodes. If it works file then you could able to see ceph-osd process running in process list. And do not need to add any public or private network in ceph.conf. If none of the OSDs run then you need to reconfigure them from monitor node. Please check ceph-mon process is running on monitor node or not? ceph-mds should not run. also check /etc/hosts file with valid ip address of cluster nodes Finally check ceph.client.admin.keyring and ceph.bootstrap-osd.keyring should be matched in all the cluster nodes. Best of luck. Srinivas. On Wed, Mar 5, 2014 at 3:04 PM, Georgios Dimitrakakis wrote: Hi! I have installed ceph and created two osds and was very happy with that but apparently not everything was correct. Today after a system reboot the cluster comes up and for a few moments it seems that its ok (using the "ceph health" command) but after a few seconds the "ceph health" command doesnt produce any output at all. It justs stays there without anything on the screen... ceph -w is doing the same as well... If I restart the ceph services ("service ceph restart") again for a few seconds is working but after a few more it stays frozen. Initially I thought that this was a firewall problem but apparently it isnt. Then I though that this had to do with the public_network cluster_network not defined in ceph.conf and changed that. No matter whatever I do the cluster works for a few seconds after the service restart and then it stops responding... Any help much appreciated!!! Best, G. ___ ceph-users mailing list ceph-users@lists.ceph.com [1] [1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2] [2] Links: -- [1] mailto:ceph-users@lists.ceph.com [3] [2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4] [3] mailto:gior...@acmac.uoc.gr [5] -- Links: -- [1] mailto:ceph-users@lists.ceph.com [2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3] mailto:ceph-users@lists.ceph.com [4] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [5] mailto:gior...@acmac.uoc.gr [6] http://192.168.0.10:0/1008298 [7] http://karan-mj.blogspot.in/2013/12/what-is-ceph-ceph-is-open-source.html [8] mailto:gior...@acmac.uoc.gr -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] XFS tunning on OSD
On 03/06/2014 01:51 AM, Robert van Leeuwen wrote: Hi, We experience something similar with our Openstack Swift setup. You can change the sysstl "vm.vfs_cache_pressure" to make sure more inodes are being kept in cache. (Do not set this to 0 because you will trigger the OOM killer at some point ;) I've been setting it to around 10 which helps in some cases (up to about 20% from what I've seen). I actually see the most benefit with mid-sized IOs around 128K in size. I suspect there is a curve where if the IOs are big you aren't doing that many lookups, and if the IOs are small you don't evict inodes/dentries due to buffered data. Somewhere in the middle is where it hurts more. We also decided to go for nodes with more memory and smaller disks. You can read about our experiences here: http://engineering.spilgames.com/openstack-swift-lots-small-files/ Cheers, Robert From: ceph-users-boun...@lists.ceph.com [ceph-users-boun...@lists.ceph.com] on behalf of Guang Yang [yguan...@yahoo.com] Hello all, Recently I am working on Ceph performance analysis on our cluster, our OSD hardware looks like: 11 SATA disks, 4TB for each, 7200RPM 48GB RAM When break down the latency, we found that half of the latency (average latency is around 60 milliseconds via radosgw) comes from file lookup and open (there could be a couple of disk seeks there). When looking at the file system cache (slabtop), we found that around 5M dentry / inodes are cached, however, the host has around 110 million files (and directories) in total. I am wondering if there is any good experience within community tunning for the same workload, e.g. change the in ode size ? use mkfs.xfs -n size=64k option[1] ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon servers
On Wed, Mar 5, 2014 at 12:49 PM, Jonathan Gowar wrote: > On Wed, 2014-03-05 at 16:35 +, Joao Eduardo Luis wrote: >> On 03/05/2014 02:30 PM, Jonathan Gowar wrote: >> > In an attempt to add a mon server, I appear to have completely broken a >> > mon service to the cluster:- >> >> Did you start the mon you added? How did you add the new monitor? > > From the admin node:- > http://pastebin.com/AYKgevyF Ah you added a monitor with ceph-deploy but that is not something that is supported (yet) See: http://tracker.ceph.com/issues/6638 This should be released in the upcoming ceph-deploy version. But what it means is that you kind of deployed monitors that have no idea how to communicate with the ones that were deployed before. > > > > >> Are your other monitors running? How many are they? > > There is 1 mon server running, on ceph-1:- > > /usr/bin/ceph-mon -i ceph-1 --pid-file /var/run/ceph/mon.ceph-1.pid > -c /etc/ceph/ceph.conf > >> For each of your monitors, assuming you have started all of them, >> please share the results from >> >> ceph daemon mon.N mon_status > > ceph-1 is the only server currently able to run. > > # ceph daemon mon.ceph-1 mon_status > { "name": "ceph-1", > "rank": 0, > "state": "probing", > "election_epoch": 0, > "quorum": [], > "outside_quorum": [ > "ceph-1"], > "extra_probe_peers": [], > "sync_provider": [], > "monmap": { "epoch": 2, > "fsid": "6c0cbf2a-8b30-4b03-b4f5-6d56386b1c14", > "modified": "2014-03-05 12:32:58.200957", > "created": "0.00", > "mons": [ > { "rank": 0, > "name": "ceph-1", > "addr": "10.11.4.52:6789\/0"}, > { "rank": 1, > "name": "ceph-3", > "addr": "10.11.4.54:6789\/0"}]}} > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] XFS tunning on OSD
Le 05/03/2014 15:34, Guang Yang a écrit : Hello all, Hellon Recently I am working on Ceph performance analysis on our cluster, our OSD hardware looks like: 11 SATA disks, 4TB for each, 7200RPM 48GB RAM When break down the latency, we found that half of the latency (average latency is around 60 milliseconds via radosgw) comes from file lookup and open (there could be a couple of disk seeks there). When looking at the file system cache (slabtop), we found that around 5M dentry / inodes are cached, however, the host has around 110 million files (and directories) in total. I am wondering if there is any good experience within community tunning for the same workload, e.g. change the in ode size ? use mkfs.xfs -n size=64k option[1] ? beware, this particular option can trigger weird behaviour : see : ceph bugs #6301, and http://oss.sgi.com/archives/xfs/2013-12/msg00087.html looking the logs on git kernel repo, AFAICS the patch only has been integrated on 3.14rc1, and has not been backported (commit b3f03bac8132207a20286d5602eda64500c19724). Cheers, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon servers
On Thu, 2014-03-06 at 09:02 -0500, Alfredo Deza wrote: > > From the admin node:- > > http://pastebin.com/AYKgevyF > > Ah you added a monitor with ceph-deploy but that is not something that > is supported (yet) > > See: http://tracker.ceph.com/issues/6638 > > This should be released in the upcoming ceph-deploy version. > > But what it means is that you kind of deployed monitors that have no > idea how to communicate > with the ones that were deployed before. Fortunately the cluster is not production, so I actually was able to laugh at this :) What's the best way to resolve this then? Regards, Jon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon servers
I'm confused... The bug tracker says this was resolved ten days ago. Also, I actually used ceph-deploy on 2/12/2014 to add two monitors to my cluster, and it worked, and the documentation says it can be done. However, I believe that I added the new mon's to the ceph.conf in the 'mon_initial_members' line before I added them. Maybe this is the reason it worked? Maybe I'm misunderstanding the issue? Brad -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jonathan Gowar Sent: Thursday, March 06, 2014 9:27 AM To: Alfredo Deza Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] mon servers On Thu, 2014-03-06 at 09:02 -0500, Alfredo Deza wrote: > > From the admin node:- > > http://pastebin.com/AYKgevyF > > Ah you added a monitor with ceph-deploy but that is not something that > is supported (yet) > > See: http://tracker.ceph.com/issues/6638 > > This should be released in the upcoming ceph-deploy version. > > But what it means is that you kind of deployed monitors that have no > idea how to communicate with the ones that were deployed before. Fortunately the cluster is not production, so I actually was able to laugh at this :) What's the best way to resolve this then? Regards, Jon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Qemu iotune values for RBD
Hi all, We're about to go live with some qemu rate limiting to RBD, and I wanted to crosscheck our values with this list, in case someone can chime in with their experience or known best practices. The only reasonable, non test-suite, values I found on the web are: iops_wr 200 iops_rd 400 bps_wr 4000 bps_rd 8000 and those seem (to me) to offer a "pretty good" service level, with more iops than a typical disk yet lower throughput (which is good considering our single gigabit NICs on the hypervisors). Our main goal for the rate limiting is to protect the cluster from abusive users running fio, etc., while not overly restricting our varied legitimate applications. Any opinions here? Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Qemu iotune values for RBD
On 03/06/2014 08:38 PM, Dan van der Ster wrote: Hi all, We're about to go live with some qemu rate limiting to RBD, and I wanted to crosscheck our values with this list, in case someone can chime in with their experience or known best practices. The only reasonable, non test-suite, values I found on the web are: iops_wr 200 iops_rd 400 bps_wr 4000 bps_rd 8000 and those seem (to me) to offer a "pretty good" service level, with more iops than a typical disk yet lower throughput (which is good considering our single gigabit NICs on the hypervisors). Our main goal for the rate limiting is to protect the cluster from abusive users running fio, etc., while not overly restricting our varied legitimate applications. Any opinions here? I normally only limit the writes since those are the most expensive in a Ceph cluster due to replication. With reads you can't really kill the disks since at some point all the objects will probably be in the page cache of the OSDs. I don't see any good reason to limit reads, but if you do, I'd set it to something like 2.5k reads and 200MB/sec or so. Just to give the VM to boost with reads when needed. You'll probably see that your cluster does a lot of writes and not so many reads. Wido Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon servers
On Thu, Mar 6, 2014 at 2:06 PM, McNamara, Bradley wrote: > I'm confused... > > The bug tracker says this was resolved ten days ago. The release for that feature is not out yet. Also, I actually used ceph-deploy on 2/12/2014 to add two monitors to my cluster, and it worked, and the documentation says it can be done. Docs probably refer to the upcoming feature yet-to-be-released. Those features are always marked with a "new in version N.N,N" However, I believe that I added the new mon's to the ceph.conf in the 'mon_initial_members' line before I added them. Maybe this is the reason it worked? That doesn't sound right. The steps to add a monitor to an existing cluster are very different. > > Maybe I'm misunderstanding the issue? > > Brad > > -Original Message- > From: ceph-users-boun...@lists.ceph.com > [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jonathan Gowar > Sent: Thursday, March 06, 2014 9:27 AM > To: Alfredo Deza > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] mon servers > > On Thu, 2014-03-06 at 09:02 -0500, Alfredo Deza wrote: >> > From the admin node:- >> > http://pastebin.com/AYKgevyF >> >> Ah you added a monitor with ceph-deploy but that is not something that >> is supported (yet) >> >> See: http://tracker.ceph.com/issues/6638 >> >> This should be released in the upcoming ceph-deploy version. >> >> But what it means is that you kind of deployed monitors that have no >> idea how to communicate with the ones that were deployed before. > > Fortunately the cluster is not production, so I actually was able to laugh at > this :) > > What's the best way to resolve this then? > > Regards, > Jon > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Writing behavior in CEPH for VM using RBD
Ello — I’ve been watching with great eagerness at the design and features of ceph especially compared to the current distributed file systems I use. One of the pains with VM work loads is when writes stall for more than a few seconds, virtual machines that think they are communicating with a real live block device generally error out their file systems, in the case of ext? they remount as read only, with file and operating systems the behaviors for that scenario is…erratic at best. It looks like the default write timeout for an OSD is 30 seconds. With the write consistency behavior that ceph has, does than mean a write could be stalled by the client for up to 30 seconds in the event of an OSD failing to write, for whatever reason? If that is the case, is there a way around such a long timeout in block device terms short of 1 second checks? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Utilizing host DAS in XEN or XCP
I am new to Ceph and am looking at trying to utilize some existing hardware to perform 2 tasks per node. We have a 2 servers which can hold 12 & 16 drives and probably 4 servers which take 4 drives. Ideally, if possible, we would like to install XCP on each of these servers, and use Ceph to cluster the DAS (direct attached storage) 44Gb of native storage in total. We would install a single SSD in each server to install XCP & DOM0 then setup 4Tb RAID0 Arrays on each server, hoping to run a Ceph OSD for each 4Tb array. Is this scenario possible? and if so, any pointers on the best way to install Ceph onto each XCP host Thanks in advance. Paul ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com