Re: [ceph-users] Flapping osd / continuously reported as failed

2013-07-25 Thread Mostowiec Dominik
Hi
We found something else.
After osd.72 flapp, one PG '3.54d' was recovering long time.

--
ceph health details
HEALTH_WARN 1 pgs recovering; recovery 1/39821745 degraded (0.000%)
pg 3.54d is active+recovering, acting [72,108,23]
recovery 1/39821745 degraded (0.000%)
--

Last flap down/up osd.72 was 00:45.
In logs we found:
2013-07-24 00:45:02.736740 7f8ac1e04700  0 log [INF] : 3.54d deep-scrub ok
After this time is ok.

It is possible that reason of flapping this osd was scrubbing?

We have default scrubbing settings (ceph version 0.56.6).
If scrubbig is the trouble-maker, can we make it a little more light by 
changing config?

--
Regards
Dominik

-Original Message-
From: Studziński Krzysztof 
Sent: Wednesday, July 24, 2013 9:48 AM
To: Gregory Farnum; Yehuda Sadeh
Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec Dominik
Subject: RE: [ceph-users] Flapping osd / continuously reported as failed

> -Original Message-
> From: Studziński Krzysztof
> Sent: Wednesday, July 24, 2013 1:18 AM
> To: 'Gregory Farnum'; Yehuda Sadeh
> Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec 
> Dominik
> Subject: RE: [ceph-users] Flapping osd / continuously reported as 
> failed
> 
> > -Original Message-
> > From: Gregory Farnum [mailto:g...@inktank.com]
> > Sent: Wednesday, July 24, 2013 12:28 AM
> > To: Studziński Krzysztof; Yehuda Sadeh
> > Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec 
> > Dominik
> > Subject: Re: [ceph-users] Flapping osd / continuously reported as 
> > failed
> >
> > On Tue, Jul 23, 2013 at 3:20 PM, Studziński Krzysztof 
> >  wrote:
> > >> On Tue, Jul 23, 2013 at 2:50 PM, Studziński Krzysztof 
> > >>  wrote:
> > >> > Hi,
> > >> > We've got some problem with our cluster - it continuously 
> > >> > reports
> failed
> > >> one osd and after auto-rebooting everything seems to work fine 
> > >> for
> some
> > >> time (few minutes). CPU util of this osd is max 8%, iostat is 
> > >> very low. We
> > tried
> > >> to "ceph osd out" such flapping osd, but after recovering this 
> > >> behavior returned on different osd. This osd has also much more 
> > >> read operations
> > than
> > >> others (see file osd_reads.png linked at the bottom of the email; 
> > >> at
> about
> > >> 16:00 we switched off osd.57 and osd.72 started to misbehave. 
> > >> Osd.108 works while recovering).
> > >> >
> > >> > Extract from ceph.log:
> > >> >
> > >> > 2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : 
> > >> > [WRN]
> > map
> > >> e41730 wrongly marked me down
> > >> > 2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.9 
> > >> 10.177.64.4:6946/5124
> > >> > 2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.78 
> > >> 10.177.64.5:6854/5604
> > >> > 2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.10
> 10.177.64.4:6814/26192
> > >> > 2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 
> > >> 2013-07-23
> > >> 22:50:43.611939 >= grace 20.00)
> > >> > 2013-07-23 22:50:30.367398 7f8adb837700  0 log [WRN] : 3 slow
> requests,
> > 3
> > >> included below; oldest blocked for > 30.688891 secs
> > >> > 2013-07-23 22:50:30.367408 7f8adb837700  0 log [WRN] : slow 
> > >> > request
> > >> 30.688891 seconds old, received at 2013-07-23 22:49:59.678453:
> > >> sd_op(client.44290048.0:125899 .dir.4168.2 [call
> rgw.bucket_prepare_op]
> > >> 3.9447554d) v4 currently no flag points reached
> > >> > 2013-07-23 22:50:30.367412 7f8adb837700  0 log [WRN] : slow 
> > >> > request
> > >> 30.179044 seconds old, received at 2013-07-23 22:50:00.188300:
> > >> sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list]
> > 3.9447554d)
> > >> v4 currently no flag points reached
> > >> > 2013-07-23 22:50:30.367415 7f8adb837700  0 log [WRN] : slow 
> > >> > request
> > >> 30.171968 seconds old, received at 2013-07-23 22:50:00.195376:
> > >> sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list]
> > 3.9447554d)
> > >> v4 currently no flag points reached
> > >> > 2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : 
> > >> > [WRN]
> > map
> > >> e41738 wrongly marked me down
> > >> > 2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.20 
> > >> 10.177.64.4:6913/4101
> > >> > 2013-07-23 22:52:05.587388 mon.0 10.177.64.4:678

Re: [ceph-users] v0.61.6 Cuttlefish update released

2013-07-25 Thread peter
Any news on this? I'm not sure if you guys received the link to the log 
and monitor files. One monitor and osd is still crashing with the error 
below.


On 2013-07-24 09:57, pe...@2force.nl wrote:

Hi Sage,

I just had a 0.61.6 monitor crash and one osd. The mon and all osds
restarted just fine after the update but it decided to crash after 15
minutes orso. See a snippet of the logfile below. I have you sent a
link to the logfiles and monitor store. It seems the bug hasn't been
fully fixed or something else is going on. I have to note though that
I had one monitor with a clock skew warning for a few minutes (this
happened because of a reboot it was fixed by ntp). So beware when
upgrading.

Cheers,

mon:

--- begin dump of recent events ---
 0> 2013-07-24 09:42:57.655257 7f262392e780 -1 *** Caught signal
(Aborted) **
 in thread 7f262392e780

 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
 1: /usr/bin/ceph-mon() [0x597cfa]
 2: (()+0xfcb0) [0x7f2622fc8cb0]
 3: (gsignal()+0x35) [0x7f2621b9e425]
 4: (abort()+0x17b) [0x7f2621ba1b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f26224f069d]
 6: (()+0xb5846) [0x7f26224ee846]
 7: (()+0xb5873) [0x7f26224ee873]
 8: (()+0xb596e) [0x7f26224ee96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x64ffaf]
 10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
 11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
 12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
 13: (Monitor::init_paxos()+0xf5) [0x48e7d5]
 14: (Monitor::preinit()+0x6ac) [0x4a4e6c]
 15: (main()+0x1c19) [0x4835c9]
 16: (__libc_start_main()+0xed) [0x7f2621b8976d]
 17: /usr/bin/ceph-mon() [0x485eed]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mon.ceph3.log
--- end dump of recent events ---
2013-07-24 09:42:57.935730 7fb08d67a780  0 ceph version 0.61.6
(59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon, pid
19878
2013-07-24 09:42:57.943330 7fb08d67a780  1 mon.ceph3@-1(probing) e1
preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd
2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fb08d67a780 time 2013-07-24 09:42:57.964379
mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0)

 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
 1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
 2: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
 4: (Monitor::init_paxos()+0xf5) [0x48e7d5]
 5: (Monitor::preinit()+0x6ac) [0x4a4e6c]
 6: (main()+0x1c19) [0x4835c9]
 7: (__libc_start_main()+0xed) [0x7fb08b8d576d]
 8: /usr/bin/ceph-mon() [0x485eed]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- begin dump of recent events ---
   -25> 2013-07-24 09:42:57.933545 7fb08d67a780  5 asok(0x1a1e000)
register_command perfcounters_dump hook 0x1a13010
   -24> 2013-07-24 09:42:57.933581 7fb08d67a780  5 asok(0x1a1e000)
register_command 1 hook 0x1a13010
   -23> 2013-07-24 09:42:57.933584 7fb08d67a780  5 asok(0x1a1e000)
register_command perf dump hook 0x1a13010
   -22> 2013-07-24 09:42:57.933592 7fb08d67a780  5 asok(0x1a1e000)
register_command perfcounters_schema hook 0x1a13010
   -21> 2013-07-24 09:42:57.933595 7fb08d67a780  5 asok(0x1a1e000)
register_command 2 hook 0x1a13010
   -20> 2013-07-24 09:42:57.933597 7fb08d67a780  5 asok(0x1a1e000)
register_command perf schema hook 0x1a13010
   -19> 2013-07-24 09:42:57.933601 7fb08d67a780  5 asok(0x1a1e000)
register_command config show hook 0x1a13010
   -18> 2013-07-24 09:42:57.933604 7fb08d67a780  5 asok(0x1a1e000)
register_command config set hook 0x1a13010
   -17> 2013-07-24 09:42:57.933606 7fb08d67a780  5 asok(0x1a1e000)
register_command log flush hook 0x1a13010
   -16> 2013-07-24 09:42:57.933609 7fb08d67a780  5 asok(0x1a1e000)
register_command log dump hook 0x1a13010
   -15> 2013-07-24 09:42:57.933612 7fb08d67a780  5 asok(0x1a1e000)
register_command log reopen hook 0x1a13010
   -14> 2013-07-24 09:42:57.935730 7fb08d67a780  0 ceph version
0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35), process c

Re: [ceph-users] v0.61.6 Cuttlefish update released

2013-07-25 Thread Wido den Hollander

On 07/25/2013 11:46 AM, pe...@2force.nl wrote:

Any news on this? I'm not sure if you guys received the link to the log
and monitor files. One monitor and osd is still crashing with the error
below.


I think you are seeing this issue: http://tracker.ceph.com/issues/5737

You can try with new packages from here: 
http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/


That should resolve it.

Wido



On 2013-07-24 09:57, pe...@2force.nl wrote:

Hi Sage,

I just had a 0.61.6 monitor crash and one osd. The mon and all osds
restarted just fine after the update but it decided to crash after 15
minutes orso. See a snippet of the logfile below. I have you sent a
link to the logfiles and monitor store. It seems the bug hasn't been
fully fixed or something else is going on. I have to note though that
I had one monitor with a clock skew warning for a few minutes (this
happened because of a reboot it was fixed by ntp). So beware when
upgrading.

Cheers,

mon:

--- begin dump of recent events ---
 0> 2013-07-24 09:42:57.655257 7f262392e780 -1 *** Caught signal
(Aborted) **
 in thread 7f262392e780

 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
 1: /usr/bin/ceph-mon() [0x597cfa]
 2: (()+0xfcb0) [0x7f2622fc8cb0]
 3: (gsignal()+0x35) [0x7f2621b9e425]
 4: (abort()+0x17b) [0x7f2621ba1b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f26224f069d]
 6: (()+0xb5846) [0x7f26224ee846]
 7: (()+0xb5873) [0x7f26224ee873]
 8: (()+0xb596e) [0x7f26224ee96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x64ffaf]
 10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
 11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
 12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
 13: (Monitor::init_paxos()+0xf5) [0x48e7d5]
 14: (Monitor::preinit()+0x6ac) [0x4a4e6c]
 15: (main()+0x1c19) [0x4835c9]
 16: (__libc_start_main()+0xed) [0x7f2621b8976d]
 17: /usr/bin/ceph-mon() [0x485eed]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mon.ceph3.log
--- end dump of recent events ---
2013-07-24 09:42:57.935730 7fb08d67a780  0 ceph version 0.61.6
(59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon, pid
19878
2013-07-24 09:42:57.943330 7fb08d67a780  1 mon.ceph3@-1(probing) e1
preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd
2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fb08d67a780 time 2013-07-24 09:42:57.964379
mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0)

 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
 1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
 2: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
 4: (Monitor::init_paxos()+0xf5) [0x48e7d5]
 5: (Monitor::preinit()+0x6ac) [0x4a4e6c]
 6: (main()+0x1c19) [0x4835c9]
 7: (__libc_start_main()+0xed) [0x7fb08b8d576d]
 8: /usr/bin/ceph-mon() [0x485eed]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- begin dump of recent events ---
   -25> 2013-07-24 09:42:57.933545 7fb08d67a780  5 asok(0x1a1e000)
register_command perfcounters_dump hook 0x1a13010
   -24> 2013-07-24 09:42:57.933581 7fb08d67a780  5 asok(0x1a1e000)
register_command 1 hook 0x1a13010
   -23> 2013-07-24 09:42:57.933584 7fb08d67a780  5 asok(0x1a1e000)
register_command perf dump hook 0x1a13010
   -22> 2013-07-24 09:42:57.933592 7fb08d67a780  5 asok(0x1a1e000)
register_command perfcounters_schema hook 0x1a13010
   -21> 2013-07-24 09:42:57.933595 7fb08d67a780  5 asok(0x1a1e000)
register_command 2 hook 0x1a13010
   -20> 2013-07-24 09:42:57.933597 7fb08d67a780  5 asok(0x1a1e000)
register_command perf schema hook 0x1a13010
   -19> 2013-07-24 09:42:57.933601 7fb08d67a780  5 asok(0x1a1e000)
register_command config show hook 0x1a13010
   -18> 2013-07-24 09:42:57.933604 7fb08d67a780  5 asok(0x1a1e000)
register_command config set hook 0x1a13010
   -17> 2013-07-24 09:42:57.933606 7fb08d67a780  5 asok(0x1a1e000)
register_command log flush hook 0x1a13010
   -16> 2013-07-24 09:42:57.933609 7fb08d67a780  5 asok(0x1a1e00

Re: [ceph-users] v0.61.6 Cuttlefish update released

2013-07-25 Thread peter

On 2013-07-25 11:52, Wido den Hollander wrote:

On 07/25/2013 11:46 AM, pe...@2force.nl wrote:
Any news on this? I'm not sure if you guys received the link to the 
log
and monitor files. One monitor and osd is still crashing with the 
error

below.


I think you are seeing this issue: http://tracker.ceph.com/issues/5737

You can try with new packages from here:
http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/

That should resolve it.

Wido


Hi Wido,

This is the same issue I reported earlier with 0.61.5. I applied the 
above package and the problem was solved. Then 0.61.6 was released with 
a fix for this issue. I installed 0.61.6 and the issue is back on one of 
my monitors and I have one osd crashing. So, it seems the bug is still 
there in 0.61.6 or it is a new bug. It seems the guys from Inktank 
haven't picked this up yet.


Regards,





On 2013-07-24 09:57, pe...@2force.nl wrote:

Hi Sage,

I just had a 0.61.6 monitor crash and one osd. The mon and all osds
restarted just fine after the update but it decided to crash after 
15

minutes orso. See a snippet of the logfile below. I have you sent a
link to the logfiles and monitor store. It seems the bug hasn't been
fully fixed or something else is going on. I have to note though 
that

I had one monitor with a clock skew warning for a few minutes (this
happened because of a reboot it was fixed by ntp). So beware when
upgrading.

Cheers,

mon:

--- begin dump of recent events ---
 0> 2013-07-24 09:42:57.655257 7f262392e780 -1 *** Caught signal
(Aborted) **
 in thread 7f262392e780

 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
 1: /usr/bin/ceph-mon() [0x597cfa]
 2: (()+0xfcb0) [0x7f2622fc8cb0]
 3: (gsignal()+0x35) [0x7f2621b9e425]
 4: (abort()+0x17b) [0x7f2621ba1b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) 
[0x7f26224f069d]

 6: (()+0xb5846) [0x7f26224ee846]
 7: (()+0xb5873) [0x7f26224ee873]
 8: (()+0xb596e) [0x7f26224ee96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x64ffaf]
 10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
 11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
 12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
 13: (Monitor::init_paxos()+0xf5) [0x48e7d5]
 14: (Monitor::preinit()+0x6ac) [0x4a4e6c]
 15: (main()+0x1c19) [0x4835c9]
 16: (__libc_start_main()+0xed) [0x7f2621b8976d]
 17: /usr/bin/ceph-mon() [0x485eed]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mon.ceph3.log
--- end dump of recent events ---
2013-07-24 09:42:57.935730 7fb08d67a780  0 ceph version 0.61.6
(59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon, pid
19878
2013-07-24 09:42:57.943330 7fb08d67a780  1 mon.ceph3@-1(probing) e1
preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd
2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fb08d67a780 time 2013-07-24 09:42:57.964379
mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0)

 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
 1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
 2: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
 4: (Monitor::init_paxos()+0xf5) [0x48e7d5]
 5: (Monitor::preinit()+0x6ac) [0x4a4e6c]
 6: (main()+0x1c19) [0x4835c9]
 7: (__libc_start_main()+0xed) [0x7fb08b8d576d]
 8: /usr/bin/ceph-mon() [0x485eed]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- begin dump of recent events ---
   -25> 2013-07-24 09:42:57.933545 7fb08d67a780  5 asok(0x1a1e000)
register_command perfcounters_dump hook 0x1a13010
   -24> 2013-07-24 09:42:57.933581 7fb08d67a780  5 asok(0x1a1e000)
register_command 1 hook 0x1a13010
   -23> 2013-07-24 09:42:57.933584 7fb08d67a780  5 asok(0x1a1e000)
register_command perf dump hook 0x1a13010
   -22> 2013-07-24 09:42:57.933592 7fb08d67a780  5 asok(0x1a1e000)
register_command perfcounters_schema hook 0x1a13010
   -21> 2013-07-24 09:42:57.933595 7fb08d67a780  5 asok(0x1a1e000)
register_command 2 hook 0x1a13010
   -20> 2013-07-24 09:42:57.933597 7fb08d

Re: [ceph-users] v0.61.6 Cuttlefish update released

2013-07-25 Thread Wido den Hollander

On 07/25/2013 12:01 PM, pe...@2force.nl wrote:

On 2013-07-25 11:52, Wido den Hollander wrote:

On 07/25/2013 11:46 AM, pe...@2force.nl wrote:

Any news on this? I'm not sure if you guys received the link to the log
and monitor files. One monitor and osd is still crashing with the error
below.


I think you are seeing this issue: http://tracker.ceph.com/issues/5737

You can try with new packages from here:
http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/


That should resolve it.

Wido


Hi Wido,

This is the same issue I reported earlier with 0.61.5. I applied the
above package and the problem was solved. Then 0.61.6 was released with
a fix for this issue. I installed 0.61.6 and the issue is back on one of
my monitors and I have one osd crashing. So, it seems the bug is still
there in 0.61.6 or it is a new bug. It seems the guys from Inktank
haven't picked this up yet.



It has been picked up, Sage mentioned this yesterday on the dev list:

"This is fixed in the cuttlefish branch as of earlier this afternoon. 
I've spent most of the day expanding the automated test suite to include 
upgrade combinations to trigger this and *finally* figured out that this 
particular problem seems to surface on clusters that upgraded from 
bobtail-> cuttlefish but not clusters created on cuttlefish.


If you've run into this issue, please use the cuttlefish branch build 
for now.  We will have a release out in the next day or so that includes 
this and a few other pending fixes.


I'm sorry we missed this one!  The upgrade test matrix I've been working 
on today should catch this type of issue in the future."


Wido


Regards,





On 2013-07-24 09:57, pe...@2force.nl wrote:

Hi Sage,

I just had a 0.61.6 monitor crash and one osd. The mon and all osds
restarted just fine after the update but it decided to crash after 15
minutes orso. See a snippet of the logfile below. I have you sent a
link to the logfiles and monitor store. It seems the bug hasn't been
fully fixed or something else is going on. I have to note though that
I had one monitor with a clock skew warning for a few minutes (this
happened because of a reboot it was fixed by ntp). So beware when
upgrading.

Cheers,

mon:

--- begin dump of recent events ---
 0> 2013-07-24 09:42:57.655257 7f262392e780 -1 *** Caught signal
(Aborted) **
 in thread 7f262392e780

 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
 1: /usr/bin/ceph-mon() [0x597cfa]
 2: (()+0xfcb0) [0x7f2622fc8cb0]
 3: (gsignal()+0x35) [0x7f2621b9e425]
 4: (abort()+0x17b) [0x7f2621ba1b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f26224f069d]
 6: (()+0xb5846) [0x7f26224ee846]
 7: (()+0xb5873) [0x7f26224ee873]
 8: (()+0xb596e) [0x7f26224ee96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x64ffaf]
 10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
 11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
 12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
 13: (Monitor::init_paxos()+0xf5) [0x48e7d5]
 14: (Monitor::preinit()+0x6ac) [0x4a4e6c]
 15: (main()+0x1c19) [0x4835c9]
 16: (__libc_start_main()+0xed) [0x7f2621b8976d]
 17: /usr/bin/ceph-mon() [0x485eed]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mon.ceph3.log
--- end dump of recent events ---
2013-07-24 09:42:57.935730 7fb08d67a780  0 ceph version 0.61.6
(59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon, pid
19878
2013-07-24 09:42:57.943330 7fb08d67a780  1 mon.ceph3@-1(probing) e1
preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd
2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fb08d67a780 time 2013-07-24 09:42:57.964379
mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0)

 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
 1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
 2: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
 4: (Monitor::init_paxos()+0xf5) [0x48e7d5]
 5: (Monitor::preinit()+0x6ac) [0x4a4e6c]
 6: (main()+0x1c19) [0x4835c9]
 7: (__l

Re: [ceph-users] v0.61.6 Cuttlefish update released

2013-07-25 Thread peter

On 2013-07-25 12:08, Wido den Hollander wrote:

On 07/25/2013 12:01 PM, pe...@2force.nl wrote:

On 2013-07-25 11:52, Wido den Hollander wrote:

On 07/25/2013 11:46 AM, pe...@2force.nl wrote:
Any news on this? I'm not sure if you guys received the link to the 
log
and monitor files. One monitor and osd is still crashing with the 
error

below.


I think you are seeing this issue: 
http://tracker.ceph.com/issues/5737


You can try with new packages from here:
http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/


That should resolve it.

Wido


Hi Wido,

This is the same issue I reported earlier with 0.61.5. I applied the
above package and the problem was solved. Then 0.61.6 was released 
with
a fix for this issue. I installed 0.61.6 and the issue is back on one 
of
my monitors and I have one osd crashing. So, it seems the bug is 
still

there in 0.61.6 or it is a new bug. It seems the guys from Inktank
haven't picked this up yet.



It has been picked up, Sage mentioned this yesterday on the dev list:

"This is fixed in the cuttlefish branch as of earlier this afternoon.
I've spent most of the day expanding the automated test suite to
include upgrade combinations to trigger this and *finally* figured out
that this particular problem seems to surface on clusters that
upgraded from bobtail-> cuttlefish but not clusters created on
cuttlefish.

If you've run into this issue, please use the cuttlefish branch build
for now.  We will have a release out in the next day or so that
includes this and a few other pending fixes.

I'm sorry we missed this one!  The upgrade test matrix I've been
working on today should catch this type of issue in the future."

Wido


Regards,


We created this cluster on cuttlefish and not on bobtail so it doesn't 
apply. I'm not sure if it is clear what I am trying to say or that I'm 
missing something here but I still see this issue either way :-)


I will check out the dev list also but perhaps someone from Inktank can 
at least look at the files I provided.


Peter







On 2013-07-24 09:57, pe...@2force.nl wrote:

Hi Sage,

I just had a 0.61.6 monitor crash and one osd. The mon and all 
osds
restarted just fine after the update but it decided to crash after 
15
minutes orso. See a snippet of the logfile below. I have you sent 
a
link to the logfiles and monitor store. It seems the bug hasn't 
been
fully fixed or something else is going on. I have to note though 
that
I had one monitor with a clock skew warning for a few minutes 
(this

happened because of a reboot it was fixed by ntp). So beware when
upgrading.

Cheers,

mon:

--- begin dump of recent events ---
 0> 2013-07-24 09:42:57.655257 7f262392e780 -1 *** Caught 
signal

(Aborted) **
 in thread 7f262392e780

 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
 1: /usr/bin/ceph-mon() [0x597cfa]
 2: (()+0xfcb0) [0x7f2622fc8cb0]
 3: (gsignal()+0x35) [0x7f2621b9e425]
 4: (abort()+0x17b) [0x7f2621ba1b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) 
[0x7f26224f069d]

 6: (()+0xb5846) [0x7f26224ee846]
 7: (()+0xb5873) [0x7f26224ee873]
 8: (()+0xb596e) [0x7f26224ee96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x64ffaf]
 10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
 11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
 12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
 13: (Monitor::init_paxos()+0xf5) [0x48e7d5]
 14: (Monitor::preinit()+0x6ac) [0x4a4e6c]
 15: (main()+0x1c19) [0x4835c9]
 16: (__libc_start_main()+0xed) [0x7f2621b8976d]
 17: /usr/bin/ceph-mon() [0x485eed]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mon.ceph3.log
--- end dump of recent events ---
2013-07-24 09:42:57.935730 7fb08d67a780  0 ceph version 0.61.6
(59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon, pid
19878
2013-07-24 09:42:57.943330 7fb08d67a780  1 mon.ceph3@-1(probing) 
e1

preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd
2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' 
thread

7fb08d67a780 time 2013-07-24 09:42:57.964379
mon/OSDMonitor.cc: 167: FAI

[ceph-users] A lot of pools?

2013-07-25 Thread Dzianis Kahanovich
I think to make pool-per-user (primary for cephfs; for security, quota, etc),
hundreds (or even more) of them. But I remember 2 facts:
1) info in manual about slowdown on many pools;
2) something in later changelog about hashed pool IDs (?).

How about now and numbers of pools?
And how to avoid serious overheads?

-- 
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.61.6 Cuttlefish update released

2013-07-25 Thread Joao Eduardo Luis

On 07/25/2013 11:20 AM, pe...@2force.nl wrote:

On 2013-07-25 12:08, Wido den Hollander wrote:

On 07/25/2013 12:01 PM, pe...@2force.nl wrote:

On 2013-07-25 11:52, Wido den Hollander wrote:

On 07/25/2013 11:46 AM, pe...@2force.nl wrote:

Any news on this? I'm not sure if you guys received the link to the
log
and monitor files. One monitor and osd is still crashing with the
error
below.


I think you are seeing this issue: http://tracker.ceph.com/issues/5737

You can try with new packages from here:
http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/



That should resolve it.

Wido


Hi Wido,

This is the same issue I reported earlier with 0.61.5. I applied the
above package and the problem was solved. Then 0.61.6 was released with
a fix for this issue. I installed 0.61.6 and the issue is back on one of
my monitors and I have one osd crashing. So, it seems the bug is still
there in 0.61.6 or it is a new bug. It seems the guys from Inktank
haven't picked this up yet.



It has been picked up, Sage mentioned this yesterday on the dev list:

"This is fixed in the cuttlefish branch as of earlier this afternoon.
I've spent most of the day expanding the automated test suite to
include upgrade combinations to trigger this and *finally* figured out
that this particular problem seems to surface on clusters that
upgraded from bobtail-> cuttlefish but not clusters created on
cuttlefish.

If you've run into this issue, please use the cuttlefish branch build
for now.  We will have a release out in the next day or so that
includes this and a few other pending fixes.

I'm sorry we missed this one!  The upgrade test matrix I've been
working on today should catch this type of issue in the future."

Wido


Regards,


We created this cluster on cuttlefish and not on bobtail so it doesn't
apply. I'm not sure if it is clear what I am trying to say or that I'm
missing something here but I still see this issue either way :-)

I will check out the dev list also but perhaps someone from Inktank can
at least look at the files I provided.


Peter,

We did take a look at your files (thanks a lot btw!), and as of last 
night's patches (which are now on the cuttlefish branch), your store 
worked just fine.


As Sage mentioned on ceph-devel, one of the issues would only happen on 
a bobtail -> cuttlefish cluster.  That is not your issue though.  I 
believe Sage meant the FAILED assert(latest_full > 0) -- i.e., the one 
reported on #5737.


Your issue however was caused by a bug on a patch meant to fix #5704. 
It made an on-disk key to be updated erroneously with a value for a 
version that did not yet existed at the time update_from_paxos() was 
called.  In a nutshell, one of the latest patches (see 
115468c73f121653eec2efc030d5ba998d834e43) fixed that issue and another 
patch (see 27f31895664fa7f10c1617d486f2a6ece0f97091) worked around it.


A point-release should come out soon, but in the mean time the 
cuttlefish branch should be safe to use.


If you run into any other issues, please let us know.

  -Joao

--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.61.6 Cuttlefish update released

2013-07-25 Thread peter

On 2013-07-25 15:21, Joao Eduardo Luis wrote:

On 07/25/2013 11:20 AM, pe...@2force.nl wrote:

On 2013-07-25 12:08, Wido den Hollander wrote:

On 07/25/2013 12:01 PM, pe...@2force.nl wrote:

On 2013-07-25 11:52, Wido den Hollander wrote:

On 07/25/2013 11:46 AM, pe...@2force.nl wrote:
Any news on this? I'm not sure if you guys received the link to 
the

log
and monitor files. One monitor and osd is still crashing with the
error
below.


I think you are seeing this issue: 
http://tracker.ceph.com/issues/5737


You can try with new packages from here:
http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/



That should resolve it.

Wido


Hi Wido,

This is the same issue I reported earlier with 0.61.5. I applied 
the
above package and the problem was solved. Then 0.61.6 was released 
with
a fix for this issue. I installed 0.61.6 and the issue is back on 
one of
my monitors and I have one osd crashing. So, it seems the bug is 
still

there in 0.61.6 or it is a new bug. It seems the guys from Inktank
haven't picked this up yet.



It has been picked up, Sage mentioned this yesterday on the dev 
list:


"This is fixed in the cuttlefish branch as of earlier this 
afternoon.

I've spent most of the day expanding the automated test suite to
include upgrade combinations to trigger this and *finally* figured 
out

that this particular problem seems to surface on clusters that
upgraded from bobtail-> cuttlefish but not clusters created on
cuttlefish.

If you've run into this issue, please use the cuttlefish branch 
build

for now.  We will have a release out in the next day or so that
includes this and a few other pending fixes.

I'm sorry we missed this one!  The upgrade test matrix I've been
working on today should catch this type of issue in the future."

Wido


Regards,


We created this cluster on cuttlefish and not on bobtail so it 
doesn't
apply. I'm not sure if it is clear what I am trying to say or that 
I'm

missing something here but I still see this issue either way :-)

I will check out the dev list also but perhaps someone from Inktank 
can

at least look at the files I provided.


Peter,

We did take a look at your files (thanks a lot btw!), and as of last
night's patches (which are now on the cuttlefish branch), your store
worked just fine.

As Sage mentioned on ceph-devel, one of the issues would only happen
on a bobtail -> cuttlefish cluster.  That is not your issue though.  I
believe Sage meant the FAILED assert(latest_full > 0) -- i.e., the one
reported on #5737.

Your issue however was caused by a bug on a patch meant to fix #5704.
It made an on-disk key to be updated erroneously with a value for a
version that did not yet existed at the time update_from_paxos() was
called.  In a nutshell, one of the latest patches (see
115468c73f121653eec2efc030d5ba998d834e43) fixed that issue and another
patch (see 27f31895664fa7f10c1617d486f2a6ece0f97091) worked around it.

A point-release should come out soon, but in the mean time the
cuttlefish branch should be safe to use.

If you run into any other issues, please let us know.

  -Joao


Hi Joao,

I installed the packages from that branch but I still see the same 
crashes:


root@ceph3:~/ceph# ceph-mon -v
ceph version 0.61.6-1-g28720b0 
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)

root@ceph3:~/ceph# ceph-osd -v
ceph version 0.61.6-1-g28720b0 
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)


Both monitor and one of three osds (on that host) still crash on 
startup. I must be doing something wrong if it works for you...


OSD:

--- begin dump of recent events ---
 0> 2013-07-25 15:35:32.563404 7f8172241700 -1 *** Caught signal 
(Aborted) **

 in thread 7f8172241700

 ceph version 0.61.6-1-g28720b0 
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)

 1: /usr/bin/ceph-osd() [0x79430a]
 2: (()+0xfcb0) [0x7f81833e1cb0]
 3: (gsignal()+0x35) [0x7f81814af425]
 4: (abort()+0x17b) [0x7f81814b2b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8181e0169d]
 6: (()+0xb5846) [0x7f8181dff846]
 7: (()+0xb5873) [0x7f8181dff873]
 8: (()+0xb596e) [0x7f8181dff96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1df) [0x84618f]

 10: (OSDService::get_map(unsigned int)+0x428) [0x63bc48]
 11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set, 
std::less >, 
std::allocator > >*)+0x11d) [0x63d77d]
 12: (OSD::process_peering_events(std::list > 
const&, ThreadPool::TPHandle&)+0x244) [0x63ded4]
 13: (OSD::PeeringWQ::_process(std::list > 
const&, ThreadPool::TPHandle&)+0x12) [0x678c52]

 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x83b5c6]
 15: (ThreadPool::WorkThread::entry()+0x10) [0x83d3f0]
 16: (()+0x7e9a) [0x7f81833d9e9a]
 17: (clone()+0x6d) [0x7f818156cccd]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locke

[ceph-users] ceph-deploy and bugs 5195/5205: mon.host1 does not exist in monmap, will attempt to join an existing cluster

2013-07-25 Thread Josh Holland
Hi List,

I've been having issues getting mons deployed following the
ceph-deploy instructions here[0]. My steps were:

 $ ceph-deploy new host{1..3}
 $ vi ceph.conf # Add in public network/cluster network details, as
well as change the mon IPs to those on the correct interface
 $ ceph-deploy install host{1..3}
 $ ceph-deploy mon create host{1..3}

The next step would be to run "ceph-deploy gatherkeys host1", but this
fails, as the /var/lib/ceph/bootstrap-{osd,mds} directories are both
empty. Checking the logs in /var/log/ceph uncovers an assertion
failure as in the referenced bugs[1][2], which ought to have been
fixed in version 0.61.5. I have version 0.61.6 running on Ubuntu 13.04
hosts, so I'm at a loss for why this is happening. I've tried with and
without the "public network" variable being set, but it fails in the
same way either way round.

Any help much appreciated,
Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.61.6 Cuttlefish update released

2013-07-25 Thread Joao Eduardo Luis

On 07/25/2013 02:39 PM, pe...@2force.nl wrote:

On 2013-07-25 15:21, Joao Eduardo Luis wrote:

On 07/25/2013 11:20 AM, pe...@2force.nl wrote:

On 2013-07-25 12:08, Wido den Hollander wrote:

On 07/25/2013 12:01 PM, pe...@2force.nl wrote:

On 2013-07-25 11:52, Wido den Hollander wrote:

On 07/25/2013 11:46 AM, pe...@2force.nl wrote:

Any news on this? I'm not sure if you guys received the link to the
log
and monitor files. One monitor and osd is still crashing with the
error
below.


I think you are seeing this issue:
http://tracker.ceph.com/issues/5737

You can try with new packages from here:
http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/




That should resolve it.

Wido


Hi Wido,

This is the same issue I reported earlier with 0.61.5. I applied the
above package and the problem was solved. Then 0.61.6 was released
with
a fix for this issue. I installed 0.61.6 and the issue is back on
one of
my monitors and I have one osd crashing. So, it seems the bug is still
there in 0.61.6 or it is a new bug. It seems the guys from Inktank
haven't picked this up yet.



It has been picked up, Sage mentioned this yesterday on the dev list:

"This is fixed in the cuttlefish branch as of earlier this afternoon.
I've spent most of the day expanding the automated test suite to
include upgrade combinations to trigger this and *finally* figured out
that this particular problem seems to surface on clusters that
upgraded from bobtail-> cuttlefish but not clusters created on
cuttlefish.

If you've run into this issue, please use the cuttlefish branch build
for now.  We will have a release out in the next day or so that
includes this and a few other pending fixes.

I'm sorry we missed this one!  The upgrade test matrix I've been
working on today should catch this type of issue in the future."

Wido


Regards,


We created this cluster on cuttlefish and not on bobtail so it doesn't
apply. I'm not sure if it is clear what I am trying to say or that I'm
missing something here but I still see this issue either way :-)

I will check out the dev list also but perhaps someone from Inktank can
at least look at the files I provided.


Peter,

We did take a look at your files (thanks a lot btw!), and as of last
night's patches (which are now on the cuttlefish branch), your store
worked just fine.

As Sage mentioned on ceph-devel, one of the issues would only happen
on a bobtail -> cuttlefish cluster.  That is not your issue though.  I
believe Sage meant the FAILED assert(latest_full > 0) -- i.e., the one
reported on #5737.

Your issue however was caused by a bug on a patch meant to fix #5704.
It made an on-disk key to be updated erroneously with a value for a
version that did not yet existed at the time update_from_paxos() was
called.  In a nutshell, one of the latest patches (see
115468c73f121653eec2efc030d5ba998d834e43) fixed that issue and another
patch (see 27f31895664fa7f10c1617d486f2a6ece0f97091) worked around it.

A point-release should come out soon, but in the mean time the
cuttlefish branch should be safe to use.

If you run into any other issues, please let us know.

  -Joao


Hi Joao,

I installed the packages from that branch but I still see the same crashes:

root@ceph3:~/ceph# ceph-mon -v
ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3)
root@ceph3:~/ceph# ceph-osd -v
ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3)

Both monitor and one of three osds (on that host) still crash on
startup. I must be doing something wrong if it works for you...


Yep.  Your monitors are on the wrong branch.

28720b0b4d55ef98f3b7d0855b18339e75f759e3 is wip-5737-cuttlefish's head. 
 That branch lacks an essential patch.  You should be running on the 
cuttlefish branch instead (24a56a9637afd8c64b71d264359c78a25d52be02).


  -Joao



OSD:

--- begin dump of recent events ---
  0> 2013-07-25 15:35:32.563404 7f8172241700 -1 *** Caught signal
(Aborted) **
  in thread 7f8172241700

  ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3)
  1: /usr/bin/ceph-osd() [0x79430a]
  2: (()+0xfcb0) [0x7f81833e1cb0]
  3: (gsignal()+0x35) [0x7f81814af425]
  4: (abort()+0x17b) [0x7f81814b2b8b]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8181e0169d]
  6: (()+0xb5846) [0x7f8181dff846]
  7: (()+0xb5873) [0x7f8181dff873]
  8: (()+0xb596e) [0x7f8181dff96e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x84618f]
  10: (OSDService::get_map(unsigned int)+0x428) [0x63bc48]
  11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
PG::RecoveryCtx*, std::set,
std::less >,
std::allocator > >*)+0x11d) [0x63d77d]
  12: (OSD::process_peering_events(std::list >
const&, ThreadPool::TPHandle&)+0x244) [0x63ded4]
  13: (OSD::PeeringWQ::_process(std::list >
const&, ThreadPool::TPHandle&)+0x12) [0x678c52]
  14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x83b5c6]
  15: (ThreadPool::WorkThread::entry()+0x10)

Re: [ceph-users] v0.61.6 Cuttlefish update released

2013-07-25 Thread peter

On 2013-07-25 15:55, Joao Eduardo Luis wrote:

On 07/25/2013 02:39 PM, pe...@2force.nl wrote:

On 2013-07-25 15:21, Joao Eduardo Luis wrote:

On 07/25/2013 11:20 AM, pe...@2force.nl wrote:

On 2013-07-25 12:08, Wido den Hollander wrote:

On 07/25/2013 12:01 PM, pe...@2force.nl wrote:

On 2013-07-25 11:52, Wido den Hollander wrote:

On 07/25/2013 11:46 AM, pe...@2force.nl wrote:
Any news on this? I'm not sure if you guys received the link to 
the

log
and monitor files. One monitor and osd is still crashing with 
the

error
below.


I think you are seeing this issue:
http://tracker.ceph.com/issues/5737

You can try with new packages from here:
http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/




That should resolve it.

Wido


Hi Wido,

This is the same issue I reported earlier with 0.61.5. I applied 
the
above package and the problem was solved. Then 0.61.6 was 
released

with
a fix for this issue. I installed 0.61.6 and the issue is back on
one of
my monitors and I have one osd crashing. So, it seems the bug is 
still
there in 0.61.6 or it is a new bug. It seems the guys from 
Inktank

haven't picked this up yet.



It has been picked up, Sage mentioned this yesterday on the dev 
list:


"This is fixed in the cuttlefish branch as of earlier this 
afternoon.

I've spent most of the day expanding the automated test suite to
include upgrade combinations to trigger this and *finally* figured 
out

that this particular problem seems to surface on clusters that
upgraded from bobtail-> cuttlefish but not clusters created on
cuttlefish.

If you've run into this issue, please use the cuttlefish branch 
build

for now.  We will have a release out in the next day or so that
includes this and a few other pending fixes.

I'm sorry we missed this one!  The upgrade test matrix I've been
working on today should catch this type of issue in the future."

Wido


Regards,


We created this cluster on cuttlefish and not on bobtail so it 
doesn't
apply. I'm not sure if it is clear what I am trying to say or that 
I'm

missing something here but I still see this issue either way :-)

I will check out the dev list also but perhaps someone from Inktank 
can

at least look at the files I provided.


Peter,

We did take a look at your files (thanks a lot btw!), and as of last
night's patches (which are now on the cuttlefish branch), your store
worked just fine.

As Sage mentioned on ceph-devel, one of the issues would only happen
on a bobtail -> cuttlefish cluster.  That is not your issue though.  
I
believe Sage meant the FAILED assert(latest_full > 0) -- i.e., the 
one

reported on #5737.

Your issue however was caused by a bug on a patch meant to fix 
#5704.

It made an on-disk key to be updated erroneously with a value for a
version that did not yet existed at the time update_from_paxos() was
called.  In a nutshell, one of the latest patches (see
115468c73f121653eec2efc030d5ba998d834e43) fixed that issue and 
another
patch (see 27f31895664fa7f10c1617d486f2a6ece0f97091) worked around 
it.


A point-release should come out soon, but in the mean time the
cuttlefish branch should be safe to use.

If you run into any other issues, please let us know.

  -Joao


Hi Joao,

I installed the packages from that branch but I still see the same 
crashes:


root@ceph3:~/ceph# ceph-mon -v
ceph version 0.61.6-1-g28720b0 
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)

root@ceph3:~/ceph# ceph-osd -v
ceph version 0.61.6-1-g28720b0 
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)


Both monitor and one of three osds (on that host) still crash on
startup. I must be doing something wrong if it works for you...


Yep.  Your monitors are on the wrong branch.

28720b0b4d55ef98f3b7d0855b18339e75f759e3 is wip-5737-cuttlefish's
head.  That branch lacks an essential patch.  You should be running on
the cuttlefish branch instead
(24a56a9637afd8c64b71d264359c78a25d52be02).

  -Joao





Ah yes, I see now. Ok, this worked for the mon, it is running again. 
The osd is still crashing, though. Any ideas on that?


root@ceph3:~/ceph# ceph-osd -v
ceph version 0.61.6-15-g24a56a9 
(24a56a9637afd8c64b71d264359c78a25d52be02)

root@ceph3:~/ceph# ceph-mon -v
ceph version 0.61.6-15-g24a56a9 
(24a56a9637afd8c64b71d264359c78a25d52be02)






OSD:

--- begin dump of recent events ---
  0> 2013-07-25 15:35:32.563404 7f8172241700 -1 *** Caught signal
(Aborted) **
  in thread 7f8172241700

  ceph version 0.61.6-1-g28720b0 
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)

  1: /usr/bin/ceph-osd() [0x79430a]
  2: (()+0xfcb0) [0x7f81833e1cb0]
  3: (gsignal()+0x35) [0x7f81814af425]
  4: (abort()+0x17b) [0x7f81814b2b8b]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) 
[0x7f8181e0169d]

  6: (()+0xb5846) [0x7f8181dff846]
  7: (()+0xb5873) [0x7f8181dff873]
  8: (()+0xb596e) [0x7f8181dff96e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x84618f]
  10: (OSDService::get_map(unsigned int)+0x428) [0x63bc48]
  11: (OSD::a

Re: [ceph-users] ceph-deploy and bugs 5195/5205: mon.host1 does not exist in monmap, will attempt to join an existing cluster

2013-07-25 Thread Josh Holland
Links I forgot to include the first time:
[0] http://ceph.com/docs/master/rados/deployment/ceph-deploy-install/
[1] http://tracker.ceph.com/issues/5195
[2] http://tracker.ceph.com/issues/5205

Apologies for the noise,
Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] testing ceph - very slow write performances

2013-07-25 Thread Sébastien RICCIO

Hi ceph-users,

I'm actually evaluating ceph for a project and I'm getting quite low 
write performances, so please if you have time reading this post and  
give me some advices :)


My test setup using some free hardware we have laying in our datacenter:

Three ceph server nodes, on each one is running a monitor and two OSDs 
and one client node


Hardware of a node: (supermicro stuff)

Intel(R) Xeon(R) CPU X3440  @ 2.53GHz (total of 8 logical cores)
2 x Western Digital Caviar Black 1TO (WD1003FBYX-01Y7B0)
32 GB RAM DDR3
2 x Ehernet controller: Intel Corporation 82574L Gigabit Network Connection

Hardware of the client: (A dell Blade M610)
Dual Intel(R) Xeon(R) CPU E5620  @ 2.40GHz (total of 16 logical cores)
64 GB RAM DDR3
4 x Ethernet controller: Broadcom Corporation NetXtreme II BCM5709S 
Gigabit Ethernet (rev 20)
2 x Ethernet controller: Broadcom Corporation NetXtreme II BCM57711 
10-Gigabit PCIe


OS of the server nodes:

Ubuntu 12.04.2 LTS
Kernel 3.10.0-031000-generic #201306301935 SMP Sun Jun 30 23:36:16 UTC 
2013 x86_64 x86_64 x86_64 GNU/Linux


OS of the client node:
CentOS release 6.4
Kernel 3.10.1-1.el6xen.x86_64 #1 SMP Sun Jul 14 11:05:42 EST 2013 x86_64 
x86_64 x86_64 GNU/Linux


How I did setup the OS (server nodes):

I know this isn't good but as there is only two disk in the machine I've 
partitionned the disks and used them both for the OS and the OSDs, but 
well for a test run it shouldn't be that bad...


Disks layout:

partition 1: mdadm raid 1 member for the OS (30gb)
partition 2: mdadm raid 1 member for some swapspace (shouldn't be used 
anyway...)

partition 3: reserved for xfs partition for OSDs

Ceph installation:
Tried both cuttlefish (0.56) and testing (0.66).
Deployed using ceph-deploy from an admin node running on a xenserver 6.2 VM.

#ceph-deploy new ceph01 ceph02 ceph03
(edited some ceph.conf stuff)
#ceph-deploy install --stable cuttlefish ceph01 ceph02 ceph03
#ceph-deploy mon create ceph01 ceph02 ceph03
#ceph-deploy gatherkeys ceph01
#ceph-deploy osd create ceph01:/dev/sda3 ceph01:/dev/sdb3 
ceph02:/dev/sda3 ceph02:/dev/sdb3 ceph03:/dev/sda3 ceph03:/dev/sdb3
#ceph-deploy osd activate ceph01:/dev/sda3 ceph01:/dev/sdb3 
ceph02:/dev/sda3 ceph02:/dev/sdb3 ceph03:/dev/sda3 ceph03:/dev/sdb3


ceph-admin:~/cephstore$ ceph status
   health HEALTH_OK
   monmap e1: 3 mons at 
{ceph01=10.111.80.1:6789/0,ceph02=10.111.80.2:6789/0,ceph03=10.111.80.3:6789/0}, 
election epoch 6, quorum 0,1,2 ceph01,ceph02,ceph03

   osdmap e26: 6 osds: 6 up, 6 in
pgmap v258: 192 pgs: 192 active+clean; 1000 MB data, 62212 MB used, 
5346 GB / 5407 GB avail

   mdsmap e1: 0/0/1 up

Now let's do some performance testing from the client, accessing a rbd 
on the cluster.


#rbd create test --size 2
#rbd map test

raw write test (ouch something is wrong here)
#dd if=/dev/zero of=/dev/rbd1 bs=1024k count=1000 oflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 146.051 s, 7.2 MB/s

raw read test (this seems quite ok for a gbit network)
#dd if=/dev/rbd1 of=/dev/null bs=1024k count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.6368 s, 76.9 MB/s


Trying to find the bottleneck

networking testing between client and nodes (not 100% efficiency but not 
that bad)


[  3] local 10.111.80.1 port 37497 connected with 10.111.10.105 port 5001
[  3]  0.0-10.0 sec   812 MBytes   681 Mbits/sec

[  3] local 10.111.80.2 port 55912 connected with 10.111.10.105 port 5001
[  3]  0.0-10.0 sec   802 MBytes   673 Mbits/sec

[  3] local 10.111.80.3 port 45188 connected with 10.111.10.105 port 5001
[  3]  0.0-10.1 sec   707 MBytes   589 Mbits/sec

[  3] local 10.111.10.105 port 43103 connected with 10.111.80.1 port 5001
[  3]  0.0-10.2 sec   730 MBytes   601 Mbits/sec

[  3] local 10.111.10.105 port 44656 connected with 10.111.80.2 port 5001
[  3]  0.0-10.0 sec   871 MBytes   730 Mbits/sec

[  3] local 10.111.10.105 port 40455 connected with 10.111.80.3 port 5001
[  3]  0.0-10.0 sec  1005 MBytes   843 Mbits/sec


Disk throughput on the ceph nodes

/var/lib/ceph/osd/ceph-0$ sudo dd if=/dev/zero of=test bs=1024k 
count=1000 oflag=direct

1048576000 bytes (1.0 GB) copied, 7.96581 s, 132 MB/s

/var/lib/ceph/osd/ceph-1$ sudo dd if=/dev/zero of=test bs=1024k 
count=1000 oflag=direct

1048576000 bytes (1.0 GB) copied, 7.91835 s, 132 MB/s

/var/lib/ceph/osd/ceph-2$ sudo dd if=/dev/zero of=test bs=1024k 
count=1000 oflag=direct

1048576000 bytes (1.0 GB) copied, 7.55287 s, 139 MB/s

/var/lib/ceph/osd/ceph-3$ sudo dd if=/dev/zero of=test bs=1024k 
count=1000 oflag=direct

1048576000 bytes (1.0 GB) copied, 7.67281 s, 137 MB/s

/var/lib/ceph/osd/ceph-4$ sudo dd if=/dev/zero of=test bs=1024k 
count=1000 oflag=direct

1048576000 bytes (1.0 GB) copied, 8.13862 s, 129 MB/s

/var/lib/ceph/osd/ceph-5$ sudo dd if=/dev/zero of=test bs=1024k 
count=1000 oflag=direct

1048576000 bytes (1.0 GB) copied, 7.72034 s, 136 MB/s

Actually I don't know what else to check.

So let me ask if that 

Re: [ceph-users] A lot of pools?

2013-07-25 Thread Gregory Farnum
On Thursday, July 25, 2013, Dzianis Kahanovich wrote:

> I think to make pool-per-user (primary for cephfs; for security, quota,
> etc),
> hundreds (or even more) of them. But I remember 2 facts:
> 1) info in manual about slowdown on many pools;


Yep, this is still a problem; pool-per-user isn't going to work well unless
your users are covering truly prodigious amounts of space.

There is a new feature in raw RADOS that let you specify a separate
"namespace" and set access capabilities based on those; it is being worked
up through the rest of the stack now.


> 2) something in later changelog about hashed pool IDs (?).


This doesn't impact things one way or the other.
-Greg



>
> How about now and numbers of pools?
> And how to avoid serious overheads?
>
> --
> WBR, Dzianis Kahanovich AKA Denis Kaganovich,
> http://mahatma.bspu.unibel.by/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy and bugs 5195/5205: mon.host1 does not exist in monmap, will attempt to join an existing cluster

2013-07-25 Thread Sage Weil
On Thu, 25 Jul 2013, Josh Holland wrote:
> Hi List,
> 
> I've been having issues getting mons deployed following the
> ceph-deploy instructions here[0]. My steps were:
> 
>  $ ceph-deploy new host{1..3}
>  $ vi ceph.conf # Add in public network/cluster network details, as
> well as change the mon IPs to those on the correct interface
>  $ ceph-deploy install host{1..3}
>  $ ceph-deploy mon create host{1..3}
> 
> The next step would be to run "ceph-deploy gatherkeys host1", but this
> fails, as the /var/lib/ceph/bootstrap-{osd,mds} directories are both
> empty. Checking the logs in /var/log/ceph uncovers an assertion
> failure as in the referenced bugs[1][2], which ought to have been
> fixed in version 0.61.5. I have version 0.61.6 running on Ubuntu 13.04
> hosts, so I'm at a loss for why this is happening. I've tried with and
> without the "public network" variable being set, but it fails in the
> same way either way round.

I suspect the difference here is that the dns names you are specifying in 
ceph-deploy new do not match.  Are you adjusting the 'mon host' line in 
ceph.conf?  Note that you can specify a fqdn to ceph-deploy new and it 
will take the first name to be the hostname, or you can specify 
'ceph-deploy new name:fqdn_or_ip'.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Xen-API] The vdi is not available

2013-07-25 Thread Sébastien RICCIO
mount.nfs 10.254.253.9:/xen/9f9aa794-86c0-9c36-a99d-1e5fdc14a206 -o 
soft,timeo=133,retrans=2147483647,tcp,noac


this gives mount -o doesnt exist

Moya Solutions, Inc.
am...@moyasolutions.com
0 | 646-918-5238 x 102
F | 646-390-1806

- Original Message -
From: "Sébastien RICCIO" 
To: "Andres E. Moya" , xen-...@lists.xen.org
Sent: Thursday, July 25, 2013 1:08:02 PM
Subject: Re: [Xen-API] The vdi is not available

I don't get why it's not mounting with the uuid subdir. It should.

On our pool:

Jul 25 10:13:39 xen-blade10 SM: [30890] ['mount.nfs', 
'10.50.50.11:/storage/nfs1/cc744878-9d79-37df-98cb-cd88eebdab61', 
'/var/run/sr-mount/cc744878-9d79-37df-98cb-cd88eebdab61', '-o', 
'soft,timeo=133,retrans=2147483647,tcp,actimeo=0']


as a temporary dirty fix you could try:

umount  /var/run/sr-mount/9f9aa794-86c0-9c36-a99d-1e5fdc14a206
mount.nfs  10.254.253.9:/xen/9f9aa794-86c0-9c36-a99d-1e5fdc14a206 -o 
soft,timeo=133,retrans=2147483647,tcp,noac


to manually remount it correctly


On 25.07.2013 18:48, Andres E. Moya wrote:

I restarted and tried to unplug and got the same message, here is the grep


[root@nj-xen-04 ~]# grep mount.nfs /var/log/SMlog
[31636] 2013-07-24 16:43:54.140961  ['mount.nfs', 
'10.254.253.9:/secondary', 
'/var/run/sr-mount/f21def12-74a2-8fab-1e1c-f41968e889bb', '-o', 
'soft,timeo=133,retrans=2147483647,tcp,noac']
[9277] 2013-07-25 12:36:42.416286   ['mount.nfs', '10.254.253.9:/iso', 
'/var/run/sr-mount/fbfbf5b3-a37a-288a-86aa-d8d168173f98', '-o', 
'soft,timeo=133,retrans=2147483647,tcp,noac']
[9393] 2013-07-25 12:36:43.241531   ['mount.nfs', '10.254.253.9:/xen', 
'/var/run/sr-mount/9f9aa794-86c0-9c36-a99d-1e5fdc14a206', '-o', 
'soft,timeo=133,retrans=2147483647,tcp,noac']


- Original Message -
From: "Sébastien RICCIO" 
To: "Andres E. Moya" , xen-...@lists.xen.org
Sent: Thursday, July 25, 2013 12:29:24 PM
Subject: Re: [Xen-API] The vdi is not available

Okay, in this case try to reboot the server, and take a look if it fixed
the mount.

If not you should "grep mount.nfs /var/log/SMlog" and look what command
line XS use to mount your storage.


On 25.07.2013 18:22, Andres E. Moya wrote:

there are no tasks/ returns empty

Moya Solutions, Inc.
am...@moyasolutions.com
0 | 646-918-5238 x 102
F | 646-390-1806

- Original Message -
From: "Sébastien RICCIO" 
To: "Andres E. Moya" , xen-...@lists.xen.org
Sent: Thursday, July 25, 2013 12:20:05 PM
Subject: Re: [Xen-API] The vdi is not available

xe task-list uuid=9c7b7690-a301-41ef-b7d5-d4abd8b70fbc

If it returns something

xe task-cancel uuid=9c7b7690-a301-41ef-b7d5-d4abd8b70fbc

then try again to unplug the pbd

OR

if nothing is running on the server, consider trying a reboot

Sorry this is hard to debug remotely.

On 25.07.2013 18:10, Andres E. Moya wrote:

xe pbd-unplug uuid=a0739a97-408b-afed-7ac2-fe76ffec3ee7
This operation cannot be performed because this VDI is in use by some other 
operation
vdi: 96c158d3-2b31-41d1-8287-aa9fb6d5eb6c (Windows Server 2003 0)
operation: 9c7b7690-a301-41ef-b7d5-d4abd8b70fbc (Windows 7 (64-bit) (1) 0)
: 405f6cce-d750-47e1-aec3-c8f8f3ae6290 (Plesk Management 0)
: dad9b85a-ee2f-4b48-94f0-79db8dfd78dd (mx5 0)
: 13b558f8-0c3f-4df9-8766-d8e1306b25d5 (Windows Server 2008 R2 (64-bit) 
(1) 0)

this was done on the server that has nothing running on it

Moya Solutions, Inc.
am...@moyasolutions.com
0 | 646-918-5238 x 102
F | 646-390-1806

- Original Message -
From: "Sébastien RICCIO" 
To: "Andres E. Moya" 
Cc: xen-...@lists.xen.org
Sent: Thursday, July 25, 2013 12:02:12 PM
Subject: Re: [Xen-API] The vdi is not available

This looks correct. You should maybe try to unplug / replug the storage
on server where it's wrong.

for example if it's on nj-xen-03:

pbd-unplug uuid=a0739a97-408b-afed-7ac2-fe76ffec3ee7
then
pbd-plug uuid=a0739a97-408b-afed-7ac2-fe76ffec3ee7

and check if it's then mounted  the right way.


On 25.07.2013 17:36, Andres E. Moya wrote:

[root@nj-xen-01 ~]# xe pbd-list sr-uuid=9f9aa794-86c0-9c36-a99d-1e5fdc14a206
uuid ( RO)  : c53d12f6-c3a6-0ae2-75fb-c67c761b2716
 host-uuid ( RO): b8ca0c69-6023-48c5-9b61-bd5871093f4e
   sr-uuid ( RO): 9f9aa794-86c0-9c36-a99d-1e5fdc14a206
 device-config (MRO): serverpath: /xen; options: ; server: 
10.254.253.9
currently-attached ( RO): true


uuid ( RO)  : a0739a97-408b-afed-7ac2-fe76ffec3ee7
 host-uuid ( RO): a464b853-47d7-4756-b9ab-49cb00c5aebb
   sr-uuid ( RO): 9f9aa794-86c0-9c36-a99d-1e5fdc14a206
 device-config (MRO): serverpath: /xen; options: ; server: 
10.254.253.9
currently-attached ( RO): true


uuid ( RO)  : 6f2c0e7d-fdda-e406-c2e1-d4ef81552b17
 host-uuid ( RO): dab9cd1a-7ca8-4441-a78f-445580d851d2
   sr-uuid ( RO): 9f9aa794-86c0-9c36-a99d-1e5fdc14a206
 device-config (MRO): serverpath: /xen; options: ; server: 
10.

Re: [ceph-users] Flapping osd / continuously reported as failed

2013-07-25 Thread Gregory Farnum
On Thu, Jul 25, 2013 at 12:47 AM, Mostowiec Dominik
 wrote:
> Hi
> We found something else.
> After osd.72 flapp, one PG '3.54d' was recovering long time.
>
> --
> ceph health details
> HEALTH_WARN 1 pgs recovering; recovery 1/39821745 degraded (0.000%)
> pg 3.54d is active+recovering, acting [72,108,23]
> recovery 1/39821745 degraded (0.000%)
> --
>
> Last flap down/up osd.72 was 00:45.
> In logs we found:
> 2013-07-24 00:45:02.736740 7f8ac1e04700  0 log [INF] : 3.54d deep-scrub ok
> After this time is ok.
>
> It is possible that reason of flapping this osd was scrubbing?
>
> We have default scrubbing settings (ceph version 0.56.6).
> If scrubbig is the trouble-maker, can we make it a little more light by 
> changing config?

It's possible, as deep scrub in particular will add a bit of load (it
goes through and compares the object contents). Are you not having any
flapping issues any more, and did you try and find when it started the
scrub to see if it matched up with your troubles?

I'd be hesitant to turn it off as scrubbing can uncover corrupt
objects etc, but you can configure it with the settings at
http://ceph.com/docs/master/rados/configuration/osd-config-ref/#scrubbing.
(Always check the surprisingly-helpful docs when you need to do some
config or operations work!)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.67-rc2 dumpling release candidate

2013-07-25 Thread Guido Winkelmann
Am Mittwoch, 24. Juli 2013, 22:45:55 schrieb Sage Weil:
> Go forth and test!

I just upgraded a 0.61.7 cluster to 0.67-rc2. I restarted the mons first, and 
as expected, they did not join a quorum with the 0.61.7 mons, but after all of 
the mons were restarted, there was no problem any more.

One of my three osds would not come back online after the upgrade, but that is 
probably just this btrfs bug again:

https://bugzilla.kernel.org/show_bug.cgi?id=60603

I will restart the machine tomorrow and see if it comes back.

There was one qemu client running at the time of the update doing active IO, 
and apart from a temporary dip in performance, it was not affected.

The whole thing was on Fedora 18, using Kernel 3.9.10-200.fc18.x86_64 and the 
repository under http://eu.ceph.com/rpm-testing/fc18/x86_64/, and btrfs as the 
OSD filesystem.

Guido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph monitors stuck in a loop after install with ceph-deploy

2013-07-25 Thread Sage Weil
On Wed, 24 Jul 2013, pe...@2force.nl wrote:
> On 2013-07-24 07:19, Sage Weil wrote:
> > On Wed, 24 Jul 2013, S?bastien RICCIO wrote:
> > > 
> > > Hi! While trying to install ceph using ceph-deploy the monitors nodes are
> > > stuck waiting on this process:
> > > /usr/bin/python /usr/sbin/ceph-create-keys -i a (or b or c)
> > > 
> > > I tried to run mannually the command and it loops on this:
> > > connect to /var/run/ceph/ceph-mon.a.asok failed with (2) No such file or
> > > directory
> > > INFO:ceph-create-keys:ceph-mon admin socket not ready yet.
> > > But the existing sock on the nodes are /var/run/ceph/ceph-mon.ceph01.asok
> > > 
> > > Is that a bug in ceph-deploy or maybe my config file is wrong ?
> > 
> > It's the config file.  You no longer need to (or should) enumerate the
> > daemons in the config file; the sysvinit/upstart scripts find them in
> > /var/lib/ceph/{osd,mon,mds}/*.  See below:
> > 
> 
> Hi Sage,
> 
> Does this also apply if you didn't use ceph-deploy (and used the same
> directories for mon, osd etc)? Just curious if there are still any
> dependencies or if you still need to list those on clients for instance.

If you are using ceph-deploy, we touch a file 'sysvinit' or 'upstart' in 
/var/lib/ceph/osd/*/ that indicates that init system is responsible for 
that daemon.  If it is not present, the scan of those directories on 
startup will ignore it.

In the mkcephfs case, those files aren't present, and you need to instead 
explicitly enumerate the daemons in ceph.conf with [osd.N] sections and 
host = foo lines.  That will make the sysvinit script start/stop the 
daemons.

So,
 sysvinit: */sysvinit file or listed in ceph.conf. 
 upstart: */upstart file.

Hope that helps!
sage

> 
> Cheers,
> 
> Peter
> 
> 
> > > Version:  ceph -v
> > > ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
> > > 
> > > Note that using "ceph" command line utility on the nodes is working. So it
> > > looks that it know the good paths...
> > > 
> > > Config file:
> > > 
> > > [global]
> > > fsid = a1394dff-94da-4ef4-a123-55d85e839ffb
> > > mon_initial_members = ceph01, ceph02, ceph03
> > > mon_host = 10.111.80.1,10.111.80.2,10.111.80.3
> > > auth_supported = cephx
> > > osd_journal_size = 1
> > > filestore_xattr_use_omap = true
> > > auth_cluster_required = none
> > > auth_service_required = none
> > > auth_client_required = none
> > > 
> > > [client]
> > > rbd_cache = true
> > > rbd_cache_size = 536870912
> > > rbd_cache_max_dirty = 134217728
> > > rbd_cache_target_dirty = 33554432
> > > rbd_cache_max_dirty_age = 5
> > > 
> > > [osd]
> > > osd_data = /var/lib/ceph/osd/ceph-$id
> > > osd_journal = /var/lib/ceph/osd/ceph-$id/journal
> > > osd_journal_size = 1
> > > osd_mkfs_type = xfs
> > > osd_mkfs_options_xfs = "-f -i size=2048"
> > > osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k"
> > > keyring = /var/lib/ceph/osd/ceph-$id/keyring.osd.$id
> > > osd_op_threads = 24
> > > osd_disk_threads = 24
> > > osd_recovery_max_active = 1
> > > journal_dio = true
> > > journal_aio = true
> > > filestore_max_sync_interval = 100
> > > filestore_min_sync_interval = 50
> > > filestore_queue_max_ops = 2000
> > > filestore_queue_max_bytes = 536870912
> > > filestore_queue_committing_max_ops = 2000
> > > filestore_queue_committing_max_bytes = 536870912
> > > osd_max_backfills = 1
> > 
> > Just drop everything from here...
> > 
> > > 
> > > [osd.0]
> > > host = ceph01
> > > 
> > > [osd.1]
> > > host = ceph01
> > > 
> > > [osd.2]
> > > host = ceph02
> > > 
> > > [osd.3]
> > > host = ceph02
> > > 
> > > [osd.4]
> > > host = ceph03
> > > 
> > > [osd.5]
> > > host = ceph03
> > > 
> > > [mon.a]
> > > host = ceph01
> > > 
> > > [mon.b]
> > > host = ceph02
> > > 
> > > [mon.c]
> > > host = ceph03
> > 
> > ...to here!
> > 
> > sage
> > 
> > 
> > 
> > > 
> > > Cheers,
> > > S?bastien
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy and bugs 5195/5205: mon.host1 does not exist in monmap, will attempt to join an existing cluster

2013-07-25 Thread Josh Holland
Hi Sage,

On 25 July 2013 17:21, Sage Weil  wrote:
> I suspect the difference here is that the dns names you are specifying in
> ceph-deploy new do not match.

Aha, this could well be the problem. The current DNS names resolve to
the address bound to an interface that is intended to be used mostly
for things like monitoring and SSH, not the actual storage. There is a
separate subnet for the hypervisors to talk to the cluster on (i.e.
what Ceph considers the "public network"), and one for the OSDs to
talk about OSD stuff on ("cluster network").

> Are you adjusting the 'mon host' line in
> ceph.conf?  Note that you can specify a fqdn to ceph-deploy new and it
> will take the first name to be the hostname, or you can specify
> 'ceph-deploy new name:fqdn_or_ip'.

I had been changing the "mon host" line to the current "public
network" IP addresses; re-running "ceph-deploy new" with the "public
network" IPs generates an identical config file (as far as I can tell)
and still fails with the same assertion failure.

Thanks,
Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] testing ceph - very slow write performances

2013-07-25 Thread Chris Hoy Poy
yes, those drives are horrible, and you have them partitioned etc.

- don't use MDADM for Ceph OSDs, in my experience it *does* impair performance, 
it just doesn't play nice with OSDs.
-- Ceph does its own block replication - though be careful, a size of "2" is 
not necessarily as "safe" as raid10 (lose any 2 drives vs. lose 2 specific 
drives)
- For each write, it's going to write to Ceph's journal, then that OSD is going 
to ensure that each write is synced to other journals (depending on how many 
copies you have etc) - BEFORE it returns (latency!)

If it is just a test run : try dedicating a drive to the OSD, and a drive to 
the OS. To see the impact of not having SSD journals, or latency on second 
writes - try setting replication size to 1 (not great/ideal - but gives you an 
idea of how much that extra sync write for the replicated writes is having on 
performance etc). 

Ceph really really shines when it has solid state for its write journalling. 

The black caviar drives are not fantastic for latency either, that can have a 
significant impact (particularly for the journal!). 

\\chris


- Original Message -
From: "Sébastien RICCIO" 
To: ceph-users@lists.ceph.com
Sent: Thursday, 25 July, 2013 11:27:48 PM
Subject: [ceph-users] testing ceph - very slow write performances

Hi ceph-users,

I'm actually evaluating ceph for a project and I'm getting quite low 
write performances, so please if you have time reading this post and  
give me some advices :)

My test setup using some free hardware we have laying in our datacenter:

Three ceph server nodes, on each one is running a monitor and two OSDs 
and one client node

Hardware of a node: (supermicro stuff)

Intel(R) Xeon(R) CPU X3440  @ 2.53GHz (total of 8 logical cores)
2 x Western Digital Caviar Black 1TO (WD1003FBYX-01Y7B0)
32 GB RAM DDR3
2 x Ehernet controller: Intel Corporation 82574L Gigabit Network Connection

Hardware of the client: (A dell Blade M610)
Dual Intel(R) Xeon(R) CPU E5620  @ 2.40GHz (total of 16 logical cores)
64 GB RAM DDR3
4 x Ethernet controller: Broadcom Corporation NetXtreme II BCM5709S 
Gigabit Ethernet (rev 20)
2 x Ethernet controller: Broadcom Corporation NetXtreme II BCM57711 
10-Gigabit PCIe

OS of the server nodes:

Ubuntu 12.04.2 LTS
Kernel 3.10.0-031000-generic #201306301935 SMP Sun Jun 30 23:36:16 UTC 
2013 x86_64 x86_64 x86_64 GNU/Linux

OS of the client node:
CentOS release 6.4
Kernel 3.10.1-1.el6xen.x86_64 #1 SMP Sun Jul 14 11:05:42 EST 2013 x86_64 
x86_64 x86_64 GNU/Linux

How I did setup the OS (server nodes):

I know this isn't good but as there is only two disk in the machine I've 
partitionned the disks and used them both for the OS and the OSDs, but 
well for a test run it shouldn't be that bad...

Disks layout:

partition 1: mdadm raid 1 member for the OS (30gb)
partition 2: mdadm raid 1 member for some swapspace (shouldn't be used 
anyway...)
partition 3: reserved for xfs partition for OSDs

Ceph installation:
Tried both cuttlefish (0.56) and testing (0.66).
Deployed using ceph-deploy from an admin node running on a xenserver 6.2 VM.

#ceph-deploy new ceph01 ceph02 ceph03
(edited some ceph.conf stuff)
#ceph-deploy install --stable cuttlefish ceph01 ceph02 ceph03
#ceph-deploy mon create ceph01 ceph02 ceph03
#ceph-deploy gatherkeys ceph01
#ceph-deploy osd create ceph01:/dev/sda3 ceph01:/dev/sdb3 
ceph02:/dev/sda3 ceph02:/dev/sdb3 ceph03:/dev/sda3 ceph03:/dev/sdb3
#ceph-deploy osd activate ceph01:/dev/sda3 ceph01:/dev/sdb3 
ceph02:/dev/sda3 ceph02:/dev/sdb3 ceph03:/dev/sda3 ceph03:/dev/sdb3

ceph-admin:~/cephstore$ ceph status
health HEALTH_OK
monmap e1: 3 mons at 
{ceph01=10.111.80.1:6789/0,ceph02=10.111.80.2:6789/0,ceph03=10.111.80.3:6789/0},
 
election epoch 6, quorum 0,1,2 ceph01,ceph02,ceph03
osdmap e26: 6 osds: 6 up, 6 in
 pgmap v258: 192 pgs: 192 active+clean; 1000 MB data, 62212 MB used, 
5346 GB / 5407 GB avail
mdsmap e1: 0/0/1 up

Now let's do some performance testing from the client, accessing a rbd 
on the cluster.

#rbd create test --size 2
#rbd map test

raw write test (ouch something is wrong here)
#dd if=/dev/zero of=/dev/rbd1 bs=1024k count=1000 oflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 146.051 s, 7.2 MB/s

raw read test (this seems quite ok for a gbit network)
#dd if=/dev/rbd1 of=/dev/null bs=1024k count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.6368 s, 76.9 MB/s


Trying to find the bottleneck

networking testing between client and nodes (not 100% efficiency but not 
that bad)

[  3] local 10.111.80.1 port 37497 connected with 10.111.10.105 port 5001
[  3]  0.0-10.0 sec   812 MBytes   681 Mbits/sec

[  3] local 10.111.80.2 port 55912 connected with 10.111.10.105 port 5001
[  3]  0.0-10.0 sec   802 MBytes   673 Mbits/sec

[  3] local 10.111.80.3 port 45188 connected with 10.111.10.105 port 5001
[  3]  0.0-10.1 sec   707 MBytes   589 Mbits/sec

[  3] local 10.111.10.105 port 43103 conn

Re: [ceph-users] Kernel's rbd in 3.10.1

2013-07-25 Thread Josh Durgin

On 07/24/2013 09:37 PM, Mikaël Cluseau wrote:

Hi,

I have a bug in the 3.10 kernel under debian, be it a self compiled
linux-stable from the git (built with make-kpkg) or the sid's package.

I'm using format-2 images (ceph version 0.61.6
(59ddece17e36fef69ecf40e239aeffad33c9db35)) to make snapshots and clones
of a database for development purposes. So I have a replay of the
database's logs on a ceph volume and I take a snapshots at fixed points
in time : mount -> recover database until a given time -> umount ->
snapshot -> go back to 1.

In both cases, it works for a while (mount/umount cycles) and after some
time it gives me the following error on mount :

Jul 25 15:20:46 **host** kernel: [14623.808604] [ cut here
]
Jul 25 15:20:46 **host** kernel: [14623.808622] kernel BUG at
/build/linux-dT6LW0/linux-3.10.1/net/ceph/osd_client.c:2103!
Jul 25 15:20:46 **host** kernel: [14623.808641] invalid opcode: 
[#1] SMP
Jul 25 15:20:46 **host** kernel: [14623.808657] Modules linked in: cbc
rbd libceph nfsd auth_rpcgss oid_registry nfs_acl nfs lockd sunrpc
sha256_generic hmac nls_utf8 cifs dns_resolver fscache bridge stp llc
xfs loop coretemp kvm_intel kvm crc32c_intel psmouse serio_raw snd_pcm
snd_page_alloc snd_timer snd soundcore iTCO_wdt iTCO_vendor_support
i2c_i801 i7core_edac microcode pcspkr lpc_ich mfd_core joydev ioatdma
evdev edac_core acpi_cpufreq mperf button processor thermal_sys ext4
crc16 jbd2 mbcache btrfs xor zlib_deflate raid6_pq crc32c libcrc32c
raid1 ohci_hcd hid_generic usbhid hid sr_mod sg cdrom sd_mod crc_t10dif
dm_mod md_mod ata_generic ata_piix libata uhci_hcd ehci_pci ehci_hcd
scsi_mod usbcore usb_common igb i2c_algo_bit i2c_core dca ptp pps_core
Jul 25 15:20:46 **host** kernel: [14623.809005] CPU: 6 PID: 9583 Comm:
mount Not tainted 3.10-1-amd64 #1 Debian 3.10.1-1
Jul 25 15:20:46 **host** kernel: [14623.809024] Hardware name:
Supermicro X8DTU/X8DTU, BIOS 2.1b   12/30/2011
Jul 25 15:20:46 **host** kernel: [14623.809041] task: 88082dfa2840
ti: 88080e2c2000 task.ti: 88080e2c2000
Jul 25 15:20:46 **host** kernel: [14623.809059] RIP:
0010:[]  []
ceph_osdc_build_request+0x370/0x3e9 [libceph]
Jul 25 15:20:46 **host** kernel: [14623.809092] RSP:
0018:88080e2c39b8  EFLAGS: 00010216
Jul 25 15:20:46 **host** kernel: [14623.809120] RAX: 88082e589a80
RBX: 88082e589b72 RCX: 0007
Jul 25 15:20:46 **host** kernel: [14623.809151] RDX: 88082e589b6f
RSI: 88082afd9078 RDI: 88082b308258
Jul 25 15:20:46 **host** kernel: [14623.809182] RBP: 1000
R08: 88082e10a400 R09: 88082afd9000
Jul 25 15:20:46 **host** kernel: [14623.809213] R10: 8806bfb1cd60
R11: 88082d153c01 R12: 88080e88e988
Jul 25 15:20:46 **host** kernel: [14623.809243] R13: 0001
R14: 88080eb874d8 R15: 88080eb875b8
Jul 25 15:20:46 **host** kernel: [14623.809275] FS:
7f2c893b77e0() GS:88083fc4() knlGS:
Jul 25 15:20:46 **host** kernel: [14623.809322] CS:  0010 DS:  ES:
 CR0: 8005003b
Jul 25 15:20:46 **host** kernel: [14623.809350] CR2: ff600400
CR3: 0006bfbc6000 CR4: 07e0
Jul 25 15:20:46 **host** kernel: [14623.809381] DR0: 
DR1:  DR2: 
Jul 25 15:20:46 **host** kernel: [14623.809413] DR3: 
DR6: 0ff0 DR7: 0400
Jul 25 15:20:46 **host** kernel: [14623.809442] Stack:
Jul 25 15:20:46 **host** kernel: [14623.814598]  2201
88080e2c3a30 1000 88042edf2240
Jul 25 15:20:46 **host** kernel: [14623.814656]  0024a05cbb01
 88082e84f348 88080e2c3a58
Jul 25 15:20:46 **host** kernel: [14623.814710]  88080eb874d8
88080e9aa290 88027abc6000 1000
Jul 25 15:20:46 **host** kernel: [14623.814765] Call Trace:
Jul 25 15:20:46 **host** kernel: [14623.814793]  [] ?
rbd_osd_req_format_write+0x81/0x8c [rbd]
Jul 25 15:20:46 **host** kernel: [14623.814827]  [] ?
rbd_img_request_fill+0x679/0x74f [rbd]
Jul 25 15:20:46 **host** kernel: [14623.814865]  [] ?
should_resched+0x5/0x23
Jul 25 15:20:46 **host** kernel: [14623.814896]  [] ?
rbd_request_fn+0x180/0x226 [rbd]
Jul 25 15:20:46 **host** kernel: [14623.814929]  [] ?
__blk_run_queue_uncond+0x1e/0x26
Jul 25 15:20:46 **host** kernel: [14623.814960]  [] ?
blk_queue_bio+0x299/0x2e8
Jul 25 15:20:46 **host** kernel: [14623.814990]  [] ?
generic_make_request+0x96/0xd5
Jul 25 15:20:46 **host** kernel: [14623.815021]  [] ?
submit_bio+0x10a/0x13b
Jul 25 15:20:46 **host** kernel: [14623.815053]  [] ?
bio_alloc_bioset+0xd0/0x172
Jul 25 15:20:46 **host** kernel: [14623.815083]  [] ?
_submit_bh+0x1b7/0x1d4
Jul 25 15:20:46 **host** kernel: [14623.815117]  [] ?
__sync_dirty_buffer+0x4e/0x7b
Jul 25 15:20:46 **host** kernel: [14623.815164]  [] ?
ext4_commit_super+0x192/0x1db [ext4]
Jul 25 15:20:46 **host** kernel: [14623.815206]  [] ?
ext4_setup_super+0xff/0x146 [ext4]
Jul 25 15:20:46 *

Re: [ceph-users] Mounting RBD or CephFS on Ceph-Node?

2013-07-25 Thread Josh Durgin

On 07/23/2013 06:09 AM, Oliver Schulz wrote:

Dear Ceph Experts,

I remember reading that at least in the past I wasn't recommended
to mount Ceph storage on a Ceph cluster node. Given a recent kernel
(3.8/3.9) and sufficient CPU and memory resources on the nodes,
would it now be safe to

* Mount RBD oder CephFS on a Ceph cluster node?


This will probably always be unsafe for kernel clients [1] [2].


* Run a VM that is based on RBD storage (libvirt?) and/or mounts
   CephFS on a Ceph node?


Using libvirt/qemu+librbd or ceph-fuse is fine, since they are
userspace. Using a kernel client inside a VM would work too.

Josh

[1] http://wiki.ceph.com/03FAQs/01General_FAQ#How_Can_I_Give_Ceph_a_Try.3F
[2] http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/1673
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Update your account information

2013-07-25 Thread PayPal Team
Title: Update your account information







	
		
			
			

	
		
		
		
	
	
		
		
	
	
		 
	
			
			
		
		
			
			
		
		
			
			
			
			

	
	
	
	Update your account information
	

			
			 
			Dear valued customer 
			To get back into your PayPal account, you'll need to update your 
			account information.
			It's easy:
			
Click the link below to open a secure browser window.
Confirm that you're the owner of the account, and then 
follow the instructions.
confirm  all information
access your account as normal
			
			

	
		
		
		Link Now
	
			
			 
			

	
	If you didn't ask us 
	for help with your password,
	
	let us know right away. Reporting it is important 
	because it helps us prevent fraudsters from stealing your 
	information.
	 
	
	
		
			
			
 
			
		
	
	 
	 

			
			Yours sincerely,
			PayPal
			
			
		
		
			
			
		


	
		
			
			

Help Center | 

Security Center
			Copyright © 2013 PayPal. All rights reserved.
			
			PayPal (Europe) S.à r.l.et Cie, S.C.A.
			Société en Commandite par Actions
			Registered office: 22-24 Boulevard Royal, L-2449 Luxemburg
			RCS Luxemburg B 118 349
			
			PayPal Email ID PP1478
		





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] add crush rule in one command

2013-07-25 Thread Rongze Zhu
Hi folks,

Recently, I use puppet to deploy Ceph and integrate Ceph with OpenStack. We
put computeand storage together in the same cluster. So nova-compute and
OSDs will be in each server. We will create a local pool for each server,
and the pool only use the disks of each server. Local pools will be used by
Nova for root disk and ephemeral disk.

In order to use the local pools, I need add some rules for the local pools
to ensure the local pools using only local disks. There is only way to add
rule in ceph:

   1. ceph osd getcrushmap -o crush-map
   2. crushtool -c crush-map.txt -o new-crush-map
   3. ceph osd setcrushmap -i new-crush-map

 If multiple servers simultaneously set crush map(puppet agent will do
that), there is the possibility of consistency problems. So if there is an
command for adding rule, which will be very convenient. Such as:

*ceph osd crush add rule -i new-rule-file*

Could I add the command into Ceph?

Cheers,


-- 

Rongze Zhu - 朱荣泽
Email:  zrz...@gmail.com
Blog:http://way4ever.com
Weibo: http://weibo.com/metaxen
Github: https://github.com/zhurongze
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is this HEALTH_WARN indicating?

2013-07-25 Thread Greg Chavez
Any idea how we tweak this?  If I want to keep my ceph node root
volume at 85% used, that's my business, man.

Thanks.

--Greg

On Mon, Jul 8, 2013 at 4:27 PM, Mike Bryant  wrote:
> Run "ceph health detail" and it should give you more information.
> (I'd guess an osd or mon has a full hard disk)
>
> Cheers
> Mike
>
> On 8 July 2013 21:16, Jordi Llonch  wrote:
>> Hello,
>>
>> I am testing ceph using ubuntu raring with ceph version 0.61.4
>> (1669132fcfc27d0c0b5e5bb93ade59d147e23404) on 3 virtualbox nodes.
>>
>> What is this HEALTH_WARN indicating?
>>
>> # ceph -s
>>health HEALTH_WARN
>>monmap e3: 3 mons at
>> {node1=192.168.56.191:6789/0,node2=192.168.56.192:6789/0,node3=192.168.56.193:6789/0},
>> election epoch 52, quorum 0,1,2 node1,node2,node3
>>osdmap e84: 3 osds: 3 up, 3 in
>> pgmap v3209: 192 pgs: 192 active+clean; 460 MB data, 1112 MB used, 135
>> GB / 136 GB avail
>>mdsmap e37: 1/1/1 up {0=node3=up:active}, 1 up:standby
>>
>>
>> Thanks,
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Mike Bryant | Systems Administrator | Ocado Technology
> mike.bry...@ocado.com | 01707 382148 | www.ocadotechnology.com
>
> --
> Notice:  This email is confidential and may contain copyright material of
> Ocado Limited (the "Company"). Opinions and views expressed in this message
> may not necessarily reflect the opinions and views of the Company.
>
> If you are not the intended recipient, please notify us immediately and
> delete all copies of this message. Please note that it is your
> responsibility to scan this message for viruses.
>
> Company reg. no. 3875000.
>
> Ocado Limited
> Titan Court
> 3 Bishops Square
> Hatfield Business Park
> Hatfield
> Herts
> AL10 9NE
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgrade from 0.61.4 to 0.61.6 mon failed. Upgrade to 0.61.7 mon still failed.

2013-07-25 Thread Keith Phua
Hi all,

2 days ago, i upgraded one of my mon from 0.61.4 to 0.61.6. The mon failed to 
start.  I checked the mailing list and found reports of mon failed after 
upgrading to 0.61.6.  So I wait for the next release and upgraded the failed 
mon from 0.61.6 to 0.61.7.  My mon still fail to start up.

Here is the mon log:

root@atlas3-c1:/var/log/ceph# tail -100 /var/log/ceph/ceph-mon.atlas3-c1.log
2013-07-26 10:45:56.782321 7fa7df837700  0 cephx: verify_reply coudln't decrypt 
with error: error decoding block for decryption
2013-07-26 10:45:56.782329 7fa7df837700  0 -- 172.18.185.73:6789/0 >> 
172.18.185.79:6789/0 pipe(0x1c91c80 sd=33 :53442 s=1 pgs=0 cs=0 l=0).failed 
verifying authorize reply
2013-07-26 10:45:58.781375 7fa7e123c700  4 mon.atlas3-c1@0(probing) e4 
probe_timeout 0x1c574b0
2013-07-26 10:45:58.781386 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 bootstrap
2013-07-26 10:45:58.781389 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
unregister_cluster_logger - not registered
2013-07-26 10:45:58.781392 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
cancel_probe_timeout (none scheduled)
2013-07-26 10:45:58.781395 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
reset_sync
2013-07-26 10:45:58.781398 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 reset
2013-07-26 10:45:58.781400 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
cancel_probe_timeout (none scheduled)
2013-07-26 10:45:58.781402 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
timecheck_finish
2013-07-26 10:45:58.781404 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
scrub_reset
2013-07-26 10:45:58.781411 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
cancel_probe_timeout (none scheduled)
2013-07-26 10:45:58.781414 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
reset_probe_timeout 0x1c57440 after 2 seconds
2013-07-26 10:45:58.781424 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 probing 
other monitors
2013-07-26 10:45:58.781833 7fa7df938700 10 mon.atlas3-c1@0(probing) e4 
ms_get_authorizer for mon
2013-07-26 10:45:58.781853 7fa7e696c700 10 mon.atlas3-c1@0(probing) e4 
ms_get_authorizer for mon
2013-07-26 10:45:58.782037 7fa7dfa39700 10 mon.atlas3-c1@0(probing) e4 
ms_get_authorizer for mon
2013-07-26 10:45:58.782165 7fa7df837700 10 mon.atlas3-c1@0(probing) e4 
ms_get_authorizer for mon
2013-07-26 10:45:58.782171 7fa7df938700  0 cephx: verify_reply coudln't decrypt 
with error: error decoding block for decryption
2013-07-26 10:45:58.782171 7fa7e696c700  0 cephx: verify_reply coudln't decrypt 
with error: error decoding block for decryption
2013-07-26 10:45:58.782177 7fa7df938700  0 -- 172.18.185.73:6789/0 >> 
172.18.185.78:6789/0 pipe(0x1c91280 sd=33 :40770 s=1 pgs=0 cs=0 l=0).failed 
verifying authorize reply
2013-07-26 10:45:58.782179 7fa7e696c700  0 -- 172.18.185.73:6789/0 >> 
172.18.185.74:6789/0 pipe(0x1c91a00 sd=30 :48828 s=1 pgs=0 cs=0 l=0).failed 
verifying authorize reply
2013-07-26 10:45:58.782399 7fa7dfa39700  0 cephx: verify_reply coudln't decrypt 
with error: error decoding block for decryption
2013-07-26 10:45:58.782418 7fa7dfa39700  0 -- 172.18.185.73:6789/0 >> 
172.18.185.77:6789/0 pipe(0x1c91780 sd=32 :44505 s=1 pgs=0 cs=0 l=0).failed 
verifying authorize reply
2013-07-26 10:45:58.782447 7fa7df837700  0 cephx: verify_reply coudln't decrypt 
with error: error decoding block for decryption
2013-07-26 10:45:58.782455 7fa7df837700  0 -- 172.18.185.73:6789/0 >> 
172.18.185.79:6789/0 pipe(0x1c91c80 sd=31 :53445 s=1 pgs=0 cs=0 l=0).failed 
verifying authorize reply
2013-07-26 10:46:00.733745 7fa7e123c700 11 mon.atlas3-c1@0(probing) e4 tick
2013-07-26 10:46:00.781471 7fa7e123c700  4 mon.atlas3-c1@0(probing) e4 
probe_timeout 0x1c57440
2013-07-26 10:46:00.781479 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 bootstrap
2013-07-26 10:46:00.781481 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
unregister_cluster_logger - not registered
2013-07-26 10:46:00.781483 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
cancel_probe_timeout (none scheduled)
2013-07-26 10:46:00.781486 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
reset_sync
2013-07-26 10:46:00.781488 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 reset
2013-07-26 10:46:00.781490 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
cancel_probe_timeout (none scheduled)
2013-07-26 10:46:00.781492 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
timecheck_finish
2013-07-26 10:46:00.781495 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
scrub_reset
2013-07-26 10:46:00.781500 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
cancel_probe_timeout (none scheduled)
2013-07-26 10:46:00.781502 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 
reset_probe_timeout 0x1c57590 after 2 seconds
2013-07-26 10:46:00.781511 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 probing 
other monitors
2013-07-26 10:46:00.781984 7fa7dfa39700 10 mon.atlas3-c1@0(probing) e4 
ms_get_authorizer for mon
2013-07-26 10:46:00.782005 7fa7e696c700 10 mon.atlas3-c1@0(probing) e4 
ms_get_authorizer for mon
2013-07-26 10:46:00.782204 7fa7df938700 10 mon.atlas3-c1@0(probing) e4 
ms_get_author

Re: [ceph-users] add crush rule in one command

2013-07-25 Thread Gregory Farnum
On Thu, Jul 25, 2013 at 7:41 PM, Rongze Zhu  wrote:
> Hi folks,
>
> Recently, I use puppet to deploy Ceph and integrate Ceph with OpenStack. We
> put computeand storage together in the same cluster. So nova-compute and
> OSDs will be in each server. We will create a local pool for each server,
> and the pool only use the disks of each server. Local pools will be used by
> Nova for root disk and ephemeral disk.

Hmm, this is constraining Ceph quite a lot; I hope you've thought
about what this means in terms of data availability and even
utilization of your storage. :)

> In order to use the local pools, I need add some rules for the local pools
> to ensure the local pools using only local disks. There is only way to add
> rule in ceph:
>
> ceph osd getcrushmap -o crush-map
> crushtool -c crush-map.txt -o new-crush-map
> ceph osd setcrushmap -i new-crush-map
>
> If multiple servers simultaneously set crush map(puppet agent will do that),
> there is the possibility of consistency problems. So if there is an command
> for adding rule, which will be very convenient. Such as:
>
> ceph osd crush add rule -i new-rule-file
>
> Could I add the command into Ceph?

We love contributions to Ceph, and this is an obvious hole in our
atomic CLI-based CRUSH manipulation which a fix would be welcome for.
Please be aware that there was a significant overhaul to the way these
commands are processed internally between Cuttlefish and
Dumpling-to-be that you'll need to deal with if you want to cross that
boundary. I also recommend looking carefully at how we do the
individual pool changes and how we handle whole-map injection to make
sure the interface you use and the places you do data extraction makes
sense. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is this HEALTH_WARN indicating?

2013-07-25 Thread Gregory Farnum
On Thu, Jul 25, 2013 at 7:42 PM, Greg Chavez  wrote:
> Any idea how we tweak this?  If I want to keep my ceph node root
> volume at 85% used, that's my business, man.

There are config options you can set. On the monitors they are "mon
osd full ratio" and "mon osd nearfull ratio"; on the OSDs you may
(not) want to change "osd failsafe full ratio" and "osd failsafe
nearfull ratio".

However, you should be *extremely careful* modifying these values.
Linux local filesystems don't much like to get this full to begin
with, and if you fill up an OSD enough that the local FS starts
failing to perform writes your cluster will become extremely unhappy.
The OSD works hard to prevent doing permanent damage, but its
prevention mechanisms tend to involve stopping all work. You should
also consider what happens if the cluster is that full and you lose a
node. Recovering from situations where clusters get past these points
tends to involve manually moving data and babysitting things for a
while; the values are as low as they are in order to provide a safety
net in case you actually do hit them.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] add crush rule in one command

2013-07-25 Thread Rongze Zhu
On Fri, Jul 26, 2013 at 1:22 PM, Gregory Farnum  wrote:

> On Thu, Jul 25, 2013 at 7:41 PM, Rongze Zhu 
> wrote:
> > Hi folks,
> >
> > Recently, I use puppet to deploy Ceph and integrate Ceph with OpenStack.
> We
> > put computeand storage together in the same cluster. So nova-compute and
> > OSDs will be in each server. We will create a local pool for each server,
> > and the pool only use the disks of each server. Local pools will be used
> by
> > Nova for root disk and ephemeral disk.
>
> Hmm, this is constraining Ceph quite a lot; I hope you've thought
> about what this means in terms of data availability and even
> utilization of your storage. :)
>

We also will create global pool for Cinder, the IOPS of global pool will be
betther than local pool.
The benefit of local pool is reducing the network traffic between servers
and Improving the management of storage. We use one same Ceph Gluster for
Nova,Cinder,Glance, and create different pools(and diffenrent rules) for
them. Maybe it need more testing :)


>
> > In order to use the local pools, I need add some rules for the local
> pools
> > to ensure the local pools using only local disks. There is only way to
> add
> > rule in ceph:
> >
> > ceph osd getcrushmap -o crush-map
> > crushtool -c crush-map.txt -o new-crush-map
> > ceph osd setcrushmap -i new-crush-map
> >
> > If multiple servers simultaneously set crush map(puppet agent will do
> that),
> > there is the possibility of consistency problems. So if there is an
> command
> > for adding rule, which will be very convenient. Such as:
> >
> > ceph osd crush add rule -i new-rule-file
> >
> > Could I add the command into Ceph?
>
> We love contributions to Ceph, and this is an obvious hole in our
> atomic CLI-based CRUSH manipulation which a fix would be welcome for.
> Please be aware that there was a significant overhaul to the way these
> commands are processed internally between Cuttlefish and
> Dumpling-to-be that you'll need to deal with if you want to cross that
> boundary. I also recommend looking carefully at how we do the
> individual pool changes and how we handle whole-map injection to make
> sure the interface you use and the places you do data extraction makes
> sense. :)
>

Thank you for your quick reply, it is very useful for me :)


> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>



-- 

Rongze Zhu - 朱荣泽
Email:  zrz...@gmail.com
Blog:http://way4ever.com
Weibo: http://weibo.com/metaxen
Github: https://github.com/zhurongze
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] add crush rule in one command

2013-07-25 Thread Rongze Zhu
On Fri, Jul 26, 2013 at 2:27 PM, Rongze Zhu  wrote:

>
>
>
> On Fri, Jul 26, 2013 at 1:22 PM, Gregory Farnum  wrote:
>
>> On Thu, Jul 25, 2013 at 7:41 PM, Rongze Zhu 
>> wrote:
>> > Hi folks,
>> >
>> > Recently, I use puppet to deploy Ceph and integrate Ceph with
>> OpenStack. We
>> > put computeand storage together in the same cluster. So nova-compute and
>> > OSDs will be in each server. We will create a local pool for each
>> server,
>> > and the pool only use the disks of each server. Local pools will be
>> used by
>> > Nova for root disk and ephemeral disk.
>>
>> Hmm, this is constraining Ceph quite a lot; I hope you've thought
>> about what this means in terms of data availability and even
>> utilization of your storage. :)
>>
>
> We also will create global pool for Cinder, the IOPS of global pool will
> be betther than local pool.
> The benefit of local pool is reducing the network traffic between servers
> and Improving the management of storage. We use one same Ceph Gluster for
> Nova,Cinder,Glance, and create different pools(and diffenrent rules) for
> them. Maybe it need more testing :)
>

s/Gluster/Cluster/g


>
>
>>
>> > In order to use the local pools, I need add some rules for the local
>> pools
>> > to ensure the local pools using only local disks. There is only way to
>> add
>> > rule in ceph:
>> >
>> > ceph osd getcrushmap -o crush-map
>> > crushtool -c crush-map.txt -o new-crush-map
>> > ceph osd setcrushmap -i new-crush-map
>> >
>> > If multiple servers simultaneously set crush map(puppet agent will do
>> that),
>> > there is the possibility of consistency problems. So if there is an
>> command
>> > for adding rule, which will be very convenient. Such as:
>> >
>> > ceph osd crush add rule -i new-rule-file
>> >
>> > Could I add the command into Ceph?
>>
>> We love contributions to Ceph, and this is an obvious hole in our
>> atomic CLI-based CRUSH manipulation which a fix would be welcome for.
>> Please be aware that there was a significant overhaul to the way these
>> commands are processed internally between Cuttlefish and
>> Dumpling-to-be that you'll need to deal with if you want to cross that
>> boundary. I also recommend looking carefully at how we do the
>> individual pool changes and how we handle whole-map injection to make
>> sure the interface you use and the places you do data extraction makes
>> sense. :)
>>
>
> Thank you for your quick reply, it is very useful for me :)
>
>
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>
>
>
> --
>
> Rongze Zhu - 朱荣泽
> Email:  zrz...@gmail.com
> Blog:http://way4ever.com
> Weibo: http://weibo.com/metaxen
> Github: https://github.com/zhurongze
>



-- 

Rongze Zhu - 朱荣泽
Email:  zrz...@gmail.com
Blog:http://way4ever.com
Weibo: http://weibo.com/metaxen
Github: https://github.com/zhurongze
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com