[ceph-users] Help, ceph mons all crashed

2014-03-06 Thread YIP Wai Peng
Hi,

I am currently facing a horrible situation. All my mons are crashing on
startup.

Here's a dump of mon.a.log. The last few ops are below. It seems to crash
trying to remove a snap? Any ideas?

- WP


   -10> 2014-03-06 17:04:38.838490 7fb2a541a700  1 --
192.168.116.24:6789/0--> osd.9
192.168.116.27:6955/11604 -- mon_subscribe_ack(300s) v1 -- ?+0 0x32d0540
-9> 2014-03-06 17:04:38.838511 7fb2a541a700  1 --
192.168.116.24:6789/0<== osd.1
192.168.116.24:6812/30221 6 
mon_subscribe({monmap=14+,osd_pg_creates=0}) v2  50+0+0 (3009623588 0
0) 0x32d71c0 con 0x3224200
-8> 2014-03-06 17:04:38.838527 7fb2a541a700  1 --
192.168.116.24:6789/0--> osd.1
192.168.116.24:6812/30221 -- mon_subscribe_ack(300s) v1 -- ?+0 0x32d3640
-7> 2014-03-06 17:04:38.838545 7fb2a541a700  1 --
192.168.116.24:6789/0<== mon.2
192.168.116.26:6789/0 1868286886  forward(pg_stats(0 pgs tid 790 v 0)
v1 caps allow rwx) to leader v1  398+0+0 (2470115819 0 0) 0x32f8c80 con
0x2f91760
-6> 2014-03-06 17:04:38.838570 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838571
lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
-5> 2014-03-06 17:04:38.838595 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838597
lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
-4> 2014-03-06 17:04:38.838626 7fb2a541a700  1 --
192.168.116.24:6789/0<== mon.2
192.168.116.26:6789/0 1868286887  forward(osd_pgtemp(e26089 {6.fe=[]}
v26089) v1 caps allow rwx) to leader v1  297+0+0 (2442013554 0 0)
0x32f8780 con 0x2f91760
-3> 2014-03-06 17:04:38.838662 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838665
lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
-2> 2014-03-06 17:04:38.838696 7fb2a541a700  1 --
192.168.116.24:6789/0<== mon.2
192.168.116.26:6789/0 1868286888  forward(pool_op(delete unmanaged snap
pool 10 auid 0 tid 27 name  v0) v4 caps allow r) to leader v1  313+0+0
(3176715156 0 0) 0x32f8500 con 0x2f91760
-1> 2014-03-06 17:04:38.838715 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838717
lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
 0> 2014-03-06 17:04:38.840833 7fb2a541a700 -1 osd/osd_types.cc: In
function 'void pg_pool_t::remove_unmanaged_snap(snapid_t)' thread
7fb2a541a700 time 2014-03-06 17:04:38.838745
osd/osd_types.cc: 799: FAILED assert(is_unmanaged_snaps_mode())

 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: /usr/bin/ceph-mon() [0x6c96e9]
 2: (OSDMonitor::prepare_pool_op(MPoolOp*)+0x970) [0x5c3ad0]
 3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1ab) [0x5c3d8b]
 4: (PaxosService::dispatch(PaxosServiceMessage*)+0xa1a) [0x5940ea]
 5: (Monitor::dispatch(MonSession*, Message*, bool)+0xdb) [0x56320b]
 6: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb]
 7: (Monitor::handle_forward(MForward*)+0x9c2) [0x565092]
 8: (Monitor::dispatch(MonSession*, Message*, bool)+0x400) [0x563530]
 9: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb]
 10: (Monitor::ms_dispatch(Message*)+0x32) [0x57f212]
 11: (DispatchQueue::entry()+0x582) [0x7de6c2]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d994d]
 13: (()+0x79d1) [0x7fb2ac05e9d1]
 14: (clone()+0x6d) [0x7fb2aad95b6d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.


- WP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help! All ceph mons crashed.

2014-03-06 Thread YIP Wai Peng
Hi,

I am currently facing a horrible situation. All my mons are crashing on
startup.

Here's a dump of mon.a.log. The last few ops are below. It seems to crash
trying to remove a snap? Any ideas?

- WP


   -10> 2014-03-06 17:04:38.838490 7fb2a541a700  1 -- 192.168.116.24:6789/0 -->
osd.9 192.168.116.27:6955/11604 -- mon_subscribe_ack(300s) v1 -- ?+0
0x32d0540
-9> 2014-03-06 17:04:38.838511 7fb2a541a700  1 -- 192.168.116.24:6789/0 <==
osd.1 192.168.116.24:6812/30221 6 
mon_subscribe({monmap=14+,osd_pg_creates=0}) v2  50+0+0 (3009623588 0
0) 0x32d71c0 con 0x3224200
-8> 2014-03-06 17:04:38.838527 7fb2a541a700  1 -- 192.168.116.24:6789/0 -->
osd.1 192.168.116.24:6812/30221 -- mon_subscribe_ack(300s) v1 -- ?+0
0x32d3640
-7> 2014-03-06 17:04:38.838545 7fb2a541a700  1 -- 192.168.116.24:6789/0 <==
mon.2 192.168.116.26:6789/0 1868286886  forward(pg_stats(0 pgs tid 790
v 0) v1 caps allow rwx) to leader v1  398+0+0 (2470115819 0 0)
0x32f8c80 con 0x2f91760
-6> 2014-03-06 17:04:38.838570 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838571
lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
-5> 2014-03-06 17:04:38.838595 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838597
lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
-4> 2014-03-06 17:04:38.838626 7fb2a541a700  1 -- 192.168.116.24:6789/0 <==
mon.2 192.168.116.26:6789/0 1868286887  forward(osd_pgtemp(e26089
{6.fe=[]} v26089) v1 caps allow rwx) to leader v1  297+0+0 (2442013554
0 0) 0x32f8780 con 0x2f91760
-3> 2014-03-06 17:04:38.838662 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838665
lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
-2> 2014-03-06 17:04:38.838696 7fb2a541a700  1 -- 192.168.116.24:6789/0 <==
mon.2 192.168.116.26:6789/0 1868286888  forward(pool_op(delete
unmanaged snap pool 10 auid 0 tid 27 name  v0) v4 caps allow r) to leader
v1  313+0+0 (3176715156 0 0) 0x32f8500 con 0x2f91760
-1> 2014-03-06 17:04:38.838715 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838717
lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
 0> 2014-03-06 17:04:38.840833 7fb2a541a700 -1 osd/osd_types.cc: In
function 'void pg_pool_t::remove_unmanaged_snap(snapid_t)' thread
7fb2a541a700 time 2014-03-06 17:04:38.838745
osd/osd_types.cc: 799: FAILED assert(is_unmanaged_snaps_mode())

 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: /usr/bin/ceph-mon() [0x6c96e9]
 2: (OSDMonitor::prepare_pool_op(MPoolOp*)+0x970) [0x5c3ad0]
 3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1ab) [0x5c3d8b]
 4: (PaxosService::dispatch(PaxosServiceMessage*)+0xa1a) [0x5940ea]
 5: (Monitor::dispatch(MonSession*, Message*, bool)+0xdb) [0x56320b]
 6: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb]
 7: (Monitor::handle_forward(MForward*)+0x9c2) [0x565092]
 8: (Monitor::dispatch(MonSession*, Message*, bool)+0x400) [0x563530]
 9: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb]
 10: (Monitor::ms_dispatch(Message*)+0x32) [0x57f212]
 11: (DispatchQueue::entry()+0x582) [0x7de6c2]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d994d]
 13: (()+0x79d1) [0x7fb2ac05e9d1]
 14: (clone()+0x6d) [0x7fb2aad95b6d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help! All ceph mons crashed.

2014-03-06 Thread YIP Wai Peng
Ok, I think I got bitten by http://tracker.ceph.com/issues/7210, or rather,
the cppool command in
http://www.sebastien-han.fr/blog/2013/03/12/ceph-change-pg-number-on-the-fly/

I did use "rados cppool  " in a pool with snapshots
(openstack glance). A user feedback that ceph crashed when he deleted an
image in openstack.

I'm now wondering if I can ignore the operation, or the openstack glance
pool, and get the mons to start up again. Any help will be greatly
appreciated!

- WP


On Thu, Mar 6, 2014 at 5:33 PM, YIP Wai Peng  wrote:

> Hi,
>
> I am currently facing a horrible situation. All my mons are crashing on
> startup.
>
> Here's a dump of mon.a.log. The last few ops are below. It seems to crash
> trying to remove a snap? Any ideas?
>
> - WP
>
> 
>-10> 2014-03-06 17:04:38.838490 7fb2a541a700  1 --
> 192.168.116.24:6789/0 --> osd.9 192.168.116.27:6955/11604 --
> mon_subscribe_ack(300s) v1 -- ?+0 0x32d0540
> -9> 2014-03-06 17:04:38.838511 7fb2a541a700  1 --
> 192.168.116.24:6789/0 <== osd.1 192.168.116.24:6812/30221 6 
> mon_subscribe({monmap=14+,osd_pg_creates=0}) v2  50+0+0 (3009623588 0
> 0) 0x32d71c0 con 0x3224200
> -8> 2014-03-06 17:04:38.838527 7fb2a541a700  1 --
> 192.168.116.24:6789/0 --> osd.1 192.168.116.24:6812/30221 --
> mon_subscribe_ack(300s) v1 -- ?+0 0x32d3640
> -7> 2014-03-06 17:04:38.838545 7fb2a541a700  1 --
> 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286886 
> forward(pg_stats(0 pgs tid 790 v 0) v1 caps allow rwx) to leader v1 
> 398+0+0 (2470115819 0 0) 0x32f8c80 con 0x2f91760
> -6> 2014-03-06 17:04:38.838570 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838571
> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
> -5> 2014-03-06 17:04:38.838595 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838597
> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
> -4> 2014-03-06 17:04:38.838626 7fb2a541a700  1 --
> 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286887 
> forward(osd_pgtemp(e26089 {6.fe=[]} v26089) v1 caps allow rwx) to leader v1
>  297+0+0 (2442013554 0 0) 0x32f8780 con 0x2f91760
> -3> 2014-03-06 17:04:38.838662 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838665
> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
> -2> 2014-03-06 17:04:38.838696 7fb2a541a700  1 --
> 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286888 
> forward(pool_op(delete unmanaged snap pool 10 auid 0 tid 27 name  v0) v4
> caps allow r) to leader v1  313+0+0 (3176715156 0 0) 0x32f8500 con
> 0x2f91760
> -1> 2014-03-06 17:04:38.838715 7fb2a541a700  1 mon.a@0(leader).paxos(paxos
> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838717
> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
>  0> 2014-03-06 17:04:38.840833 7fb2a541a700 -1 osd/osd_types.cc: In
> function 'void pg_pool_t::remove_unmanaged_snap(snapid_t)' thread
> 7fb2a541a700 time 2014-03-06 17:04:38.838745
> osd/osd_types.cc: 799: FAILED assert(is_unmanaged_snaps_mode())
>
>  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
>  1: /usr/bin/ceph-mon() [0x6c96e9]
>  2: (OSDMonitor::prepare_pool_op(MPoolOp*)+0x970) [0x5c3ad0]
>  3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1ab) [0x5c3d8b]
>  4: (PaxosService::dispatch(PaxosServiceMessage*)+0xa1a) [0x5940ea]
>  5: (Monitor::dispatch(MonSession*, Message*, bool)+0xdb) [0x56320b]
>  6: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb]
>  7: (Monitor::handle_forward(MForward*)+0x9c2) [0x565092]
>  8: (Monitor::dispatch(MonSession*, Message*, bool)+0x400) [0x563530]
>  9: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb]
>  10: (Monitor::ms_dispatch(Message*)+0x32) [0x57f212]
>  11: (DispatchQueue::entry()+0x582) [0x7de6c2]
>  12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d994d]
>  13: (()+0x79d1) [0x7fb2ac05e9d1]
>  14: (clone()+0x6d) [0x7fb2aad95b6d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
> 
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [Solved]: Help! All ceph mons crashed.

2014-03-06 Thread YIP Wai Peng
I've managed to get joao's assistance in tracking down the issue. I'll be
updating the bug 7210.

Thanks joao and all!

- WP

On Thu, Mar 6, 2014 at 6:25 PM, YIP Wai Peng  wrote:

> Ok, I think I got bitten by http://tracker.ceph.com/issues/7210, or
> rather, the cppool command in
> http://www.sebastien-han.fr/blog/2013/03/12/ceph-change-pg-number-on-the-fly/
>
> I did use "rados cppool  " in a pool with snapshots
> (openstack glance). A user feedback that ceph crashed when he deleted an
> image in openstack.
>
> I'm now wondering if I can ignore the operation, or the openstack glance
> pool, and get the mons to start up again. Any help will be greatly
> appreciated!
>
> - WP
>
>
> On Thu, Mar 6, 2014 at 5:33 PM, YIP Wai Peng wrote:
>
>> Hi,
>>
>> I am currently facing a horrible situation. All my mons are crashing on
>> startup.
>>
>> Here's a dump of mon.a.log. The last few ops are below. It seems to crash
>> trying to remove a snap? Any ideas?
>>
>> - WP
>>
>> 
>>-10> 2014-03-06 17:04:38.838490 7fb2a541a700  1 --
>> 192.168.116.24:6789/0 --> osd.9 192.168.116.27:6955/11604 --
>> mon_subscribe_ack(300s) v1 -- ?+0 0x32d0540
>> -9> 2014-03-06 17:04:38.838511 7fb2a541a700  1 --
>> 192.168.116.24:6789/0 <== osd.1 192.168.116.24:6812/30221 6 
>> mon_subscribe({monmap=14+,osd_pg_creates=0}) v2  50+0+0 (3009623588 0
>> 0) 0x32d71c0 con 0x3224200
>> -8> 2014-03-06 17:04:38.838527 7fb2a541a700  1 --
>> 192.168.116.24:6789/0 --> osd.1 192.168.116.24:6812/30221 --
>> mon_subscribe_ack(300s) v1 -- ?+0 0x32d3640
>> -7> 2014-03-06 17:04:38.838545 7fb2a541a700  1 --
>> 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286886 
>> forward(pg_stats(0 pgs tid 790 v 0) v1 caps allow rwx) to leader v1 
>> 398+0+0 (2470115819 0 0) 0x32f8c80 con 0x2f91760
>> -6> 2014-03-06 17:04:38.838570 7fb2a541a700  1 
>> mon.a@0(leader).paxos(paxos
>> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838571
>> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
>>  -5> 2014-03-06 17:04:38.838595 7fb2a541a700  1 
>> mon.a@0(leader).paxos(paxos
>> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838597
>> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
>> -4> 2014-03-06 17:04:38.838626 7fb2a541a700  1 --
>> 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286887 
>> forward(osd_pgtemp(e26089 {6.fe=[]} v26089) v1 caps allow rwx) to leader v1
>>  297+0+0 (2442013554 0 0) 0x32f8780 con 0x2f91760
>> -3> 2014-03-06 17:04:38.838662 7fb2a541a700  1 
>> mon.a@0(leader).paxos(paxos
>> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838665
>> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
>>  -2> 2014-03-06 17:04:38.838696 7fb2a541a700  1 --
>> 192.168.116.24:6789/0 <== mon.2 192.168.116.26:6789/0 1868286888 
>> forward(pool_op(delete unmanaged snap pool 10 auid 0 tid 27 name  v0) v4
>> caps allow r) to leader v1  313+0+0 (3176715156 0 0) 0x32f8500 con
>> 0x2f91760
>> -1> 2014-03-06 17:04:38.838715 7fb2a541a700  1 
>> mon.a@0(leader).paxos(paxos
>> active c 7361285..7361906) is_readable now=2014-03-06 17:04:38.838717
>> lease_expire=2014-03-06 17:04:43.820707 has v0 lc 7361906
>>   0> 2014-03-06 17:04:38.840833 7fb2a541a700 -1 osd/osd_types.cc: In
>> function 'void pg_pool_t::remove_unmanaged_snap(snapid_t)' thread
>> 7fb2a541a700 time 2014-03-06 17:04:38.838745
>> osd/osd_types.cc: 799: FAILED assert(is_unmanaged_snaps_mode())
>>
>>  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
>>  1: /usr/bin/ceph-mon() [0x6c96e9]
>>  2: (OSDMonitor::prepare_pool_op(MPoolOp*)+0x970) [0x5c3ad0]
>>  3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1ab) [0x5c3d8b]
>>  4: (PaxosService::dispatch(PaxosServiceMessage*)+0xa1a) [0x5940ea]
>>  5: (Monitor::dispatch(MonSession*, Message*, bool)+0xdb) [0x56320b]
>>  6: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb]
>>  7: (Monitor::handle_forward(MForward*)+0x9c2) [0x565092]
>>  8: (Monitor::dispatch(MonSession*, Message*, bool)+0x400) [0x563530]
>>  9: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb]
>>  10: (Monitor::ms_dispatch(Message*)+0x32) [0x57f212]
>>  11: (DispatchQueue::entry()+0x582) [0x7de6c2]
>>  12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d994d]
>>  13: (()+0x79d1) [0x7fb2ac05e9d1]
>>  14: (clone()+0x6d) [0x7fb2aad95b6d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
>> to interpret this.
>> 
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] {Disarmed} Re: Ceph stops responding

2014-03-06 Thread Jerker Nyberg


I had this error yesterday. I had run out of storage at 
/var/lib/cepoh/mon/ at the local file system on the monitor.


Kind regards,
Jerker Nyberg


On Wed, 5 Mar 2014, Georgios Dimitrakakis wrote:


Can someone help me with this error:

2014-03-05 14:54:27.253711 7f654fd3d700  0=20 
mon.client1@0(leader).data_health(96) update_stats avail 3% total=20

51606140 used 47174264 avail 1810436
2014-03-05 14:54:27.253916 7f654fd3d700 -1=20 
mon.client1@0(leader).data_health(96) reached critical levels of=20

available space on data store -- shutdown!


Why is it showing only 3% available when there is plenty of storage???


Best,


G.

On Wed, 5 Mar 2014 17:51:28 +0530, Srinivasa Rao Ragolu wrote:

Ideal setup is node1 for mon, node2 is for OSD1 and node3 is OSD2
(Nodes can be VMs also).

MDS is not required if you are not using flie system storage using
ceph

 Please follow the blog for part 1 ,2 and 3 for detailed steps

http://karan-mj.blogspot.in/2013/12/what-is-ceph-ceph-is-open-source.html
[7]

Follow each and every instruction on this blog

Thanks,Srinivas.

On Wed, Mar 5, 2014 at 3:44 PM, Georgios Dimitrakakis  wrote:


My setup consists of two nodes.

The first node (master) is running:

-mds
-mon
-osd.0

and the second node (CLIENT) is running:

-osd.1

Therefore I ve restarted ceph services on both nodes

Leaving the "ceph -w" running for as long as it can after a few
seconds the error that is produced is this:

2014-03-05 12:08:17.715699 7fba13fff700  0 monclient: hunting for
new mon
2014-03-05 12:08:17.716108 7fba102f8700  0 -- MAILSCANNER WARNING:
NUMERICAL LINKS ARE OFTEN MALICIOUS: 192.168.0.10:0/1008298 [6] >>
X.Y.Z.X:6789/0 pipe(0x7fba08008e50 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7fba080090b0).fault

(where X.Y.Z.X is the public IP of the CLIENT node).

And it keep goes on...

"ceph-health" after a few minutes shows the following

2014-03-05 12:12:58.355677 7effc52fb700  0 monclient(hunting):
authenticate timed out after 300
2014-03-05 12:12:58.355717 7effc52fb700  0 librados: client.admin
authentication error (110) Connection timed out
Error connecting to cluster: TimedOut

Any ideas now??

Best,

G.


First try to start OSD nodes by restarting the ceph service on
ceph
nodes. If it works file then you could able to see ceph-osd
process
running in process list. And do not need to add any public or
private
network in ceph.conf. If none of the OSDs run then you need to
reconfigure them from monitor node.

Please check ceph-mon process is running on monitor node or not?
ceph-mds should not run.

also check /etc/hosts file with valid ip address of cluster nodes

Finally check ceph.client.admin.keyring and
ceph.bootstrap-osd.keyring
should be matched in all the cluster nodes.

Best of luck.
Srinivas.

On Wed, Mar 5, 2014 at 3:04 PM, Georgios Dimitrakakis  wrote:


Hi!

I have installed ceph and created two osds and was very happy
with
that but apparently not everything was correct.

Today after a system reboot the cluster comes up and for a few
moments it seems that its ok (using the "ceph health" command)
but
after a few seconds the "ceph health" command doesnt produce
any

output at all.

It justs stays there without anything on the screen...

ceph -w is doing the same as well...

If I restart the ceph services ("service ceph restart") again
for a
few seconds is working but after a few more it stays frozen.

Initially I thought that this was a firewall problem but
apparently
it isnt.

Then I though that this had to do with the

public_network

cluster_network

not defined in ceph.conf and changed that.

No matter whatever I do the cluster works for a few seconds
after
the service restart and then it stops responding...

Any help much appreciated!!!

Best,

G.
___
ceph-users mailing list
ceph-users@lists.ceph.com [1] [1]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2] [2]


Links:
--
[1] mailto:ceph-users@lists.ceph.com [3]
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]
[3] mailto:gior...@acmac.uoc.gr [5]


--




Links:
--
[1] mailto:ceph-users@lists.ceph.com
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[3] mailto:ceph-users@lists.ceph.com
[4] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[5] mailto:gior...@acmac.uoc.gr
[6] http://192.168.0.10:0/1008298
[7] 
http://karan-mj.blogspot.in/2013/12/what-is-ceph-ceph-is-open-source.html

[8] mailto:gior...@acmac.uoc.gr


--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] {Disarmed} Re: Ceph stops responding

2014-03-06 Thread Georgios Dimitrakakis

Good spot!!!

The same problem here!!!

Best,

G.

On Thu, 6 Mar 2014 12:28:26 +0100 (CET), Jerker Nyberg wrote:

I had this error yesterday. I had run out of storage at
/var/lib/cepoh/mon/ at the local file system on the monitor.

Kind regards,
Jerker Nyberg


On Wed, 5 Mar 2014, Georgios Dimitrakakis wrote:


Can someone help me with this error:

2014-03-05 14:54:27.253711 7f654fd3d700  0=20 
mon.client1@0(leader).data_health(96) update_stats avail 3% total=20

51606140 used 47174264 avail 1810436
2014-03-05 14:54:27.253916 7f654fd3d700 -1=20 
mon.client1@0(leader).data_health(96) reached critical levels of=20

available space on data store -- shutdown!


Why is it showing only 3% available when there is plenty of 
storage???



Best,


G.

On Wed, 5 Mar 2014 17:51:28 +0530, Srinivasa Rao Ragolu wrote:

Ideal setup is node1 for mon, node2 is for OSD1 and node3 is OSD2
(Nodes can be VMs also).
MDS is not required if you are not using flie system storage using
ceph

 Please follow the blog for part 1 ,2 and 3 for detailed steps

http://karan-mj.blogspot.in/2013/12/what-is-ceph-ceph-is-open-source.html
[7]
Follow each and every instruction on this blog
Thanks,Srinivas.
On Wed, Mar 5, 2014 at 3:44 PM, Georgios Dimitrakakis  wrote:


My setup consists of two nodes.
The first node (master) is running:
-mds
-mon
-osd.0
and the second node (CLIENT) is running:
-osd.1
Therefore I ve restarted ceph services on both nodes
Leaving the "ceph -w" running for as long as it can after a few
seconds the error that is produced is this:
2014-03-05 12:08:17.715699 7fba13fff700  0 monclient: hunting for
new mon
2014-03-05 12:08:17.716108 7fba102f8700  0 -- MAILSCANNER WARNING:
NUMERICAL LINKS ARE OFTEN MALICIOUS: 192.168.0.10:0/1008298 [6] >>
X.Y.Z.X:6789/0 pipe(0x7fba08008e50 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7fba080090b0).fault
(where X.Y.Z.X is the public IP of the CLIENT node).
And it keep goes on...
"ceph-health" after a few minutes shows the following
2014-03-05 12:12:58.355677 7effc52fb700  0 monclient(hunting):
authenticate timed out after 300
2014-03-05 12:12:58.355717 7effc52fb700  0 librados: client.admin
authentication error (110) Connection timed out
Error connecting to cluster: TimedOut
Any ideas now??
Best,
G.


First try to start OSD nodes by restarting the ceph service on
ceph
nodes. If it works file then you could able to see ceph-osd
process
running in process list. And do not need to add any public or
private
network in ceph.conf. If none of the OSDs run then you need to
reconfigure them from monitor node.
Please check ceph-mon process is running on monitor node or not?
ceph-mds should not run.
also check /etc/hosts file with valid ip address of cluster nodes
Finally check ceph.client.admin.keyring and
ceph.bootstrap-osd.keyring
should be matched in all the cluster nodes.
Best of luck.
Srinivas.
On Wed, Mar 5, 2014 at 3:04 PM, Georgios Dimitrakakis  wrote:


Hi!
I have installed ceph and created two osds and was very happy
with
that but apparently not everything was correct.
Today after a system reboot the cluster comes up and for a few
moments it seems that its ok (using the "ceph health" command)
but
after a few seconds the "ceph health" command doesnt produce
any
output at all.
It justs stays there without anything on the screen...
ceph -w is doing the same as well...
If I restart the ceph services ("service ceph restart") again
for a
few seconds is working but after a few more it stays frozen.
Initially I thought that this was a firewall problem but
apparently
it isnt.
Then I though that this had to do with the
public_network
cluster_network
not defined in ceph.conf and changed that.
No matter whatever I do the cluster works for a few seconds
after
the service restart and then it stops responding...
Any help much appreciated!!!
Best,
G.
___
ceph-users mailing list
ceph-users@lists.ceph.com [1] [1]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2] [2]

Links:
--
[1] mailto:ceph-users@lists.ceph.com [3]
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]
[3] mailto:gior...@acmac.uoc.gr [5]

--



Links:
--
[1] mailto:ceph-users@lists.ceph.com
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[3] mailto:ceph-users@lists.ceph.com
[4] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[5] mailto:gior...@acmac.uoc.gr
[6] http://192.168.0.10:0/1008298
[7] 
http://karan-mj.blogspot.in/2013/12/what-is-ceph-ceph-is-open-source.html

[8] mailto:gior...@acmac.uoc.gr


-- ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS tunning on OSD

2014-03-06 Thread Mark Nelson

On 03/06/2014 01:51 AM, Robert van Leeuwen wrote:

Hi,

We experience something similar with our Openstack Swift setup.
You can change the sysstl "vm.vfs_cache_pressure" to make sure more inodes are 
being kept in cache.
(Do not set this to 0 because you will trigger the OOM killer at some point ;)



I've been setting it to around 10 which helps in some cases (up to about 
20% from what I've seen).  I actually see the most benefit with 
mid-sized IOs around 128K in size.  I suspect there is a curve where if 
the IOs are big you aren't doing that many lookups, and if the IOs are 
small you don't evict inodes/dentries due to buffered data.  Somewhere 
in the middle is where it hurts more.



We also decided to go for nodes with more memory and smaller disks.
You can read about our experiences here:
http://engineering.spilgames.com/openstack-swift-lots-small-files/

Cheers,
Robert


From: ceph-users-boun...@lists.ceph.com [ceph-users-boun...@lists.ceph.com] on 
behalf of Guang Yang [yguan...@yahoo.com]
Hello all,
Recently I am working on Ceph performance analysis on our cluster, our OSD 
hardware looks like:
11 SATA disks, 4TB for each, 7200RPM
48GB RAM

When break down the latency, we found that half of the latency (average latency 
is around 60 milliseconds via radosgw) comes from file lookup and open
(there could be a couple of disk seeks there). When looking at the file system  
cache (slabtop), we found
that around 5M dentry / inodes are cached, however, the host has around 110 
million files (and directories) in total.

I am wondering if there is any good experience within community tunning for the 
same workload, e.g. change the in ode size ? use mkfs.xfs -n size=64k option[1] 
?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon servers

2014-03-06 Thread Alfredo Deza
On Wed, Mar 5, 2014 at 12:49 PM, Jonathan Gowar  wrote:
> On Wed, 2014-03-05 at 16:35 +, Joao Eduardo Luis wrote:
>> On 03/05/2014 02:30 PM, Jonathan Gowar wrote:
>> > In an attempt to add a mon server, I appear to have completely broken a
>> > mon service to the cluster:-
>>
>> Did you start the mon you added?  How did you add the new monitor?
>
> From the admin node:-
> http://pastebin.com/AYKgevyF

Ah you added a monitor with ceph-deploy but that is not something that
is supported (yet)

See: http://tracker.ceph.com/issues/6638

This should be released in the upcoming ceph-deploy version.

But what it means is that you kind of deployed monitors that have no
idea how to communicate
with the ones that were deployed before.


>
>
>
>
>> Are your other monitors running?  How many are they?
>
> There is 1 mon server running, on ceph-1:-
>
> /usr/bin/ceph-mon -i ceph-1 --pid-file /var/run/ceph/mon.ceph-1.pid
> -c /etc/ceph/ceph.conf
>
>> For each of your monitors, assuming you have started all of them,
>> please share the results from
>>
>> ceph daemon mon.N mon_status
>
> ceph-1 is the only server currently able to run.
>
> # ceph daemon mon.ceph-1 mon_status
> { "name": "ceph-1",
>   "rank": 0,
>   "state": "probing",
>   "election_epoch": 0,
>   "quorum": [],
>   "outside_quorum": [
> "ceph-1"],
>   "extra_probe_peers": [],
>   "sync_provider": [],
>   "monmap": { "epoch": 2,
>   "fsid": "6c0cbf2a-8b30-4b03-b4f5-6d56386b1c14",
>   "modified": "2014-03-05 12:32:58.200957",
>   "created": "0.00",
>   "mons": [
> { "rank": 0,
>   "name": "ceph-1",
>   "addr": "10.11.4.52:6789\/0"},
> { "rank": 1,
>   "name": "ceph-3",
>   "addr": "10.11.4.54:6789\/0"}]}}
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS tunning on OSD

2014-03-06 Thread Yann Dupont - Veille Techno

Le 05/03/2014 15:34, Guang Yang a écrit :

Hello all,

Hellon
Recently I am working on Ceph performance analysis on our cluster, our 
OSD hardware looks like:

  11 SATA disks, 4TB for each, 7200RPM
   48GB RAM

When break down the latency, we found that half of the latency 
(average latency is around 60 milliseconds via radosgw) comes from 
file lookup and open (there could be a couple of disk seeks there). 
When looking at the file system  cache (slabtop), we found that around 
5M dentry / inodes are cached, however, the host has around 110 
million files (and directories) in total.


I am wondering if there is any good experience within community 
tunning for the same workload, e.g. change the in ode size ? use 
mkfs.xfs -n size=64k option[1] ?




beware, this particular option can trigger weird behaviour :

see :

ceph bugs #6301,
and
http://oss.sgi.com/archives/xfs/2013-12/msg00087.html

looking the logs on git kernel repo, AFAICS the patch only has been 
integrated on 3.14rc1, and has not been backported (commit 
b3f03bac8132207a20286d5602eda64500c19724).


Cheers,

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon servers

2014-03-06 Thread Jonathan Gowar
On Thu, 2014-03-06 at 09:02 -0500, Alfredo Deza wrote:
> > From the admin node:-
> > http://pastebin.com/AYKgevyF
> 
> Ah you added a monitor with ceph-deploy but that is not something that
> is supported (yet)
> 
> See: http://tracker.ceph.com/issues/6638
> 
> This should be released in the upcoming ceph-deploy version.
> 
> But what it means is that you kind of deployed monitors that have no
> idea how to communicate
> with the ones that were deployed before.

Fortunately the cluster is not production, so I actually was able to
laugh at this :)

What's the best way to resolve this then?

Regards,
Jon

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon servers

2014-03-06 Thread McNamara, Bradley
I'm confused...

The bug tracker says this was resolved ten days ago.  Also, I actually used 
ceph-deploy on 2/12/2014 to add two monitors to my cluster, and it worked, and 
the documentation says it can be done.  However, I believe that I added the new 
mon's to the ceph.conf in the 'mon_initial_members' line before I added them.  
Maybe this is the reason it worked?

Maybe I'm misunderstanding the issue?

Brad

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jonathan Gowar
Sent: Thursday, March 06, 2014 9:27 AM
To: Alfredo Deza
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] mon servers

On Thu, 2014-03-06 at 09:02 -0500, Alfredo Deza wrote:
> > From the admin node:-
> > http://pastebin.com/AYKgevyF
> 
> Ah you added a monitor with ceph-deploy but that is not something that 
> is supported (yet)
> 
> See: http://tracker.ceph.com/issues/6638
> 
> This should be released in the upcoming ceph-deploy version.
> 
> But what it means is that you kind of deployed monitors that have no 
> idea how to communicate with the ones that were deployed before.

Fortunately the cluster is not production, so I actually was able to laugh at 
this :)

What's the best way to resolve this then?

Regards,
Jon

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Qemu iotune values for RBD

2014-03-06 Thread Dan van der Ster
Hi all,

We're about to go live with some qemu  rate limiting to RBD, and I wanted
to crosscheck our values with this list, in case someone can chime in with
their experience or known best practices.

The only reasonable, non test-suite,  values I found on the web are:

iops_wr 200
iops_rd 400
bps_wr 4000
bps_rd 8000

and those seem (to me) to offer a "pretty good" service level, with more
iops than a typical disk yet lower throughput (which is good considering
our single gigabit NICs on the hypervisors).

Our main goal for the rate limiting is to protect the cluster from abusive
users running fio, etc., while not overly restricting our varied legitimate
applications.

Any opinions here?

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Qemu iotune values for RBD

2014-03-06 Thread Wido den Hollander

On 03/06/2014 08:38 PM, Dan van der Ster wrote:

Hi all,

We're about to go live with some qemu  rate limiting to RBD, and I
wanted to crosscheck our values with this list, in case someone can
chime in with their experience or known best practices.

The only reasonable, non test-suite,  values I found on the web are:

iops_wr 200
iops_rd 400
bps_wr 4000
bps_rd 8000

and those seem (to me) to offer a "pretty good" service level, with more
iops than a typical disk yet lower throughput (which is good considering
our single gigabit NICs on the hypervisors).

Our main goal for the rate limiting is to protect the cluster from
abusive users running fio, etc., while not overly restricting our varied
legitimate applications.

Any opinions here?



I normally only limit the writes since those are the most expensive in a 
Ceph cluster due to replication. With reads you can't really kill the 
disks since at some point all the objects will probably be in the page 
cache of the OSDs.


I don't see any good reason to limit reads, but if you do, I'd set it to 
something like 2.5k reads and 200MB/sec or so. Just to give the VM to 
boost with reads when needed.


You'll probably see that your cluster does a lot of writes and not so 
many reads.


Wido


Cheers, Dan



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon servers

2014-03-06 Thread Alfredo Deza
On Thu, Mar 6, 2014 at 2:06 PM, McNamara, Bradley
 wrote:
> I'm confused...
>
> The bug tracker says this was resolved ten days ago.

The release for that feature is not out yet.

 Also, I actually used ceph-deploy on 2/12/2014 to add two monitors to
my cluster, and it worked, and the documentation says it can be done.

Docs probably refer to the upcoming feature yet-to-be-released. Those
features are always marked with a "new in version N.N,N"


However, I believe that I added the new mon's to the ceph.conf in the
'mon_initial_members' line before I added them.  Maybe this is the
reason it worked?

That doesn't sound right. The steps to add a monitor to an existing
cluster are very different.
>
> Maybe I'm misunderstanding the issue?
>
> Brad
>
> -Original Message-
> From: ceph-users-boun...@lists.ceph.com 
> [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jonathan Gowar
> Sent: Thursday, March 06, 2014 9:27 AM
> To: Alfredo Deza
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] mon servers
>
> On Thu, 2014-03-06 at 09:02 -0500, Alfredo Deza wrote:
>> > From the admin node:-
>> > http://pastebin.com/AYKgevyF
>>
>> Ah you added a monitor with ceph-deploy but that is not something that
>> is supported (yet)
>>
>> See: http://tracker.ceph.com/issues/6638
>>
>> This should be released in the upcoming ceph-deploy version.
>>
>> But what it means is that you kind of deployed monitors that have no
>> idea how to communicate with the ones that were deployed before.
>
> Fortunately the cluster is not production, so I actually was able to laugh at 
> this :)
>
> What's the best way to resolve this then?
>
> Regards,
> Jon
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Writing behavior in CEPH for VM using RBD

2014-03-06 Thread David Bierce
Ello —

I’ve been watching with great eagerness at the design and features of ceph 
especially compared to the current distributed file systems I use.  One of the 
pains with VM work loads is when writes stall for more than a few seconds, 
virtual machines that think they are communicating with a real live block 
device generally error out their file systems, in the case of ext? they remount 
as read only, with file and operating systems the behaviors for that scenario 
is…erratic at best.

It looks like the default write timeout for an OSD is 30 seconds.  With the 
write consistency behavior that ceph has, does than mean a write could be 
stalled by the client for up to 30 seconds in the event of an OSD failing to 
write, for whatever reason?  If that is the case, is there a way around such a 
long timeout in block device terms short of 1 second checks?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Utilizing host DAS in XEN or XCP

2014-03-06 Thread Paul Mitchener
I am new to Ceph and am looking at trying to utilize some existing hardware
to perform 2 tasks per node. We have a 2 servers which can hold 12 & 16
drives and probably 4 servers which take 4 drives.


Ideally, if possible, we would like to install XCP on each of these servers,
and use Ceph to cluster the DAS (direct attached storage) 44Gb of native
storage in total.

 

We would install a single SSD in each server to install XCP & DOM0 then
setup 4Tb RAID0 Arrays on each server, hoping to run a Ceph OSD for each 4Tb
array.

 

Is this scenario possible? and if so, any pointers on the best way to
install Ceph onto each XCP host

 

Thanks in advance.
Paul

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com