Re: [ceph-users] Flapping osd / continuously reported as failed
Hi We found something else. After osd.72 flapp, one PG '3.54d' was recovering long time. -- ceph health details HEALTH_WARN 1 pgs recovering; recovery 1/39821745 degraded (0.000%) pg 3.54d is active+recovering, acting [72,108,23] recovery 1/39821745 degraded (0.000%) -- Last flap down/up osd.72 was 00:45. In logs we found: 2013-07-24 00:45:02.736740 7f8ac1e04700 0 log [INF] : 3.54d deep-scrub ok After this time is ok. It is possible that reason of flapping this osd was scrubbing? We have default scrubbing settings (ceph version 0.56.6). If scrubbig is the trouble-maker, can we make it a little more light by changing config? -- Regards Dominik -Original Message- From: Studziński Krzysztof Sent: Wednesday, July 24, 2013 9:48 AM To: Gregory Farnum; Yehuda Sadeh Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec Dominik Subject: RE: [ceph-users] Flapping osd / continuously reported as failed > -Original Message- > From: Studziński Krzysztof > Sent: Wednesday, July 24, 2013 1:18 AM > To: 'Gregory Farnum'; Yehuda Sadeh > Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec > Dominik > Subject: RE: [ceph-users] Flapping osd / continuously reported as > failed > > > -Original Message- > > From: Gregory Farnum [mailto:g...@inktank.com] > > Sent: Wednesday, July 24, 2013 12:28 AM > > To: Studziński Krzysztof; Yehuda Sadeh > > Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec > > Dominik > > Subject: Re: [ceph-users] Flapping osd / continuously reported as > > failed > > > > On Tue, Jul 23, 2013 at 3:20 PM, Studziński Krzysztof > > wrote: > > >> On Tue, Jul 23, 2013 at 2:50 PM, Studziński Krzysztof > > >> wrote: > > >> > Hi, > > >> > We've got some problem with our cluster - it continuously > > >> > reports > failed > > >> one osd and after auto-rebooting everything seems to work fine > > >> for > some > > >> time (few minutes). CPU util of this osd is max 8%, iostat is > > >> very low. We > > tried > > >> to "ceph osd out" such flapping osd, but after recovering this > > >> behavior returned on different osd. This osd has also much more > > >> read operations > > than > > >> others (see file osd_reads.png linked at the bottom of the email; > > >> at > about > > >> 16:00 we switched off osd.57 and osd.72 started to misbehave. > > >> Osd.108 works while recovering). > > >> > > > >> > Extract from ceph.log: > > >> > > > >> > 2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : > > >> > [INF] > > osd.72 > > >> 10.177.64.8:6803/22584 boot > > >> > 2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : > > >> > [WRN] > > map > > >> e41730 wrongly marked me down > > >> > 2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : > > >> > [DBG] > > osd.72 > > >> 10.177.64.8:6803/22584 reported failed by osd.9 > > >> 10.177.64.4:6946/5124 > > >> > 2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : > > >> > [DBG] > > osd.72 > > >> 10.177.64.8:6803/22584 reported failed by osd.78 > > >> 10.177.64.5:6854/5604 > > >> > 2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : > > >> > [DBG] > > osd.72 > > >> 10.177.64.8:6803/22584 reported failed by osd.10 > 10.177.64.4:6814/26192 > > >> > 2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : > > >> > [INF] > > osd.72 > > >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after > > >> 2013-07-23 > > >> 22:50:43.611939 >= grace 20.00) > > >> > 2013-07-23 22:50:30.367398 7f8adb837700 0 log [WRN] : 3 slow > requests, > > 3 > > >> included below; oldest blocked for > 30.688891 secs > > >> > 2013-07-23 22:50:30.367408 7f8adb837700 0 log [WRN] : slow > > >> > request > > >> 30.688891 seconds old, received at 2013-07-23 22:49:59.678453: > > >> sd_op(client.44290048.0:125899 .dir.4168.2 [call > rgw.bucket_prepare_op] > > >> 3.9447554d) v4 currently no flag points reached > > >> > 2013-07-23 22:50:30.367412 7f8adb837700 0 log [WRN] : slow > > >> > request > > >> 30.179044 seconds old, received at 2013-07-23 22:50:00.188300: > > >> sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list] > > 3.9447554d) > > >> v4 currently no flag points reached > > >> > 2013-07-23 22:50:30.367415 7f8adb837700 0 log [WRN] : slow > > >> > request > > >> 30.171968 seconds old, received at 2013-07-23 22:50:00.195376: > > >> sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list] > > 3.9447554d) > > >> v4 currently no flag points reached > > >> > 2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : > > >> > [INF] > > osd.72 > > >> 10.177.64.8:6803/22584 boot > > >> > 2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : > > >> > [WRN] > > map > > >> e41738 wrongly marked me down > > >> > 2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : > > >> > [DBG] > > osd.72 > > >> 10.177.64.8:6803/22584 reported failed by osd.20 > > >> 10.177.64.4:6913/4101 > > >> > 2013-07-23 22:52:05.587388 mon.0 10.177.64.4:678
Re: [ceph-users] v0.61.6 Cuttlefish update released
Any news on this? I'm not sure if you guys received the link to the log and monitor files. One monitor and osd is still crashing with the error below. On 2013-07-24 09:57, pe...@2force.nl wrote: Hi Sage, I just had a 0.61.6 monitor crash and one osd. The mon and all osds restarted just fine after the update but it decided to crash after 15 minutes orso. See a snippet of the logfile below. I have you sent a link to the logfiles and monitor store. It seems the bug hasn't been fully fixed or something else is going on. I have to note though that I had one monitor with a clock skew warning for a few minutes (this happened because of a reboot it was fixed by ntp). So beware when upgrading. Cheers, mon: --- begin dump of recent events --- 0> 2013-07-24 09:42:57.655257 7f262392e780 -1 *** Caught signal (Aborted) ** in thread 7f262392e780 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35) 1: /usr/bin/ceph-mon() [0x597cfa] 2: (()+0xfcb0) [0x7f2622fc8cb0] 3: (gsignal()+0x35) [0x7f2621b9e425] 4: (abort()+0x17b) [0x7f2621ba1b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f26224f069d] 6: (()+0xb5846) [0x7f26224ee846] 7: (()+0xb5873) [0x7f26224ee873] 8: (()+0xb596e) [0x7f26224ee96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x64ffaf] 10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77] 11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b] 12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617] 13: (Monitor::init_paxos()+0xf5) [0x48e7d5] 14: (Monitor::preinit()+0x6ac) [0x4a4e6c] 15: (main()+0x1c19) [0x4835c9] 16: (__libc_start_main()+0xed) [0x7f2621b8976d] 17: /usr/bin/ceph-mon() [0x485eed] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mon.ceph3.log --- end dump of recent events --- 2013-07-24 09:42:57.935730 7fb08d67a780 0 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon, pid 19878 2013-07-24 09:42:57.943330 7fb08d67a780 1 mon.ceph3@-1(probing) e1 preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd 2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7fb08d67a780 time 2013-07-24 09:42:57.964379 mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0) ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35) 1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77] 2: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617] 4: (Monitor::init_paxos()+0xf5) [0x48e7d5] 5: (Monitor::preinit()+0x6ac) [0x4a4e6c] 6: (main()+0x1c19) [0x4835c9] 7: (__libc_start_main()+0xed) [0x7fb08b8d576d] 8: /usr/bin/ceph-mon() [0x485eed] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- -25> 2013-07-24 09:42:57.933545 7fb08d67a780 5 asok(0x1a1e000) register_command perfcounters_dump hook 0x1a13010 -24> 2013-07-24 09:42:57.933581 7fb08d67a780 5 asok(0x1a1e000) register_command 1 hook 0x1a13010 -23> 2013-07-24 09:42:57.933584 7fb08d67a780 5 asok(0x1a1e000) register_command perf dump hook 0x1a13010 -22> 2013-07-24 09:42:57.933592 7fb08d67a780 5 asok(0x1a1e000) register_command perfcounters_schema hook 0x1a13010 -21> 2013-07-24 09:42:57.933595 7fb08d67a780 5 asok(0x1a1e000) register_command 2 hook 0x1a13010 -20> 2013-07-24 09:42:57.933597 7fb08d67a780 5 asok(0x1a1e000) register_command perf schema hook 0x1a13010 -19> 2013-07-24 09:42:57.933601 7fb08d67a780 5 asok(0x1a1e000) register_command config show hook 0x1a13010 -18> 2013-07-24 09:42:57.933604 7fb08d67a780 5 asok(0x1a1e000) register_command config set hook 0x1a13010 -17> 2013-07-24 09:42:57.933606 7fb08d67a780 5 asok(0x1a1e000) register_command log flush hook 0x1a13010 -16> 2013-07-24 09:42:57.933609 7fb08d67a780 5 asok(0x1a1e000) register_command log dump hook 0x1a13010 -15> 2013-07-24 09:42:57.933612 7fb08d67a780 5 asok(0x1a1e000) register_command log reopen hook 0x1a13010 -14> 2013-07-24 09:42:57.935730 7fb08d67a780 0 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35), process c
Re: [ceph-users] v0.61.6 Cuttlefish update released
On 07/25/2013 11:46 AM, pe...@2force.nl wrote: Any news on this? I'm not sure if you guys received the link to the log and monitor files. One monitor and osd is still crashing with the error below. I think you are seeing this issue: http://tracker.ceph.com/issues/5737 You can try with new packages from here: http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/ That should resolve it. Wido On 2013-07-24 09:57, pe...@2force.nl wrote: Hi Sage, I just had a 0.61.6 monitor crash and one osd. The mon and all osds restarted just fine after the update but it decided to crash after 15 minutes orso. See a snippet of the logfile below. I have you sent a link to the logfiles and monitor store. It seems the bug hasn't been fully fixed or something else is going on. I have to note though that I had one monitor with a clock skew warning for a few minutes (this happened because of a reboot it was fixed by ntp). So beware when upgrading. Cheers, mon: --- begin dump of recent events --- 0> 2013-07-24 09:42:57.655257 7f262392e780 -1 *** Caught signal (Aborted) ** in thread 7f262392e780 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35) 1: /usr/bin/ceph-mon() [0x597cfa] 2: (()+0xfcb0) [0x7f2622fc8cb0] 3: (gsignal()+0x35) [0x7f2621b9e425] 4: (abort()+0x17b) [0x7f2621ba1b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f26224f069d] 6: (()+0xb5846) [0x7f26224ee846] 7: (()+0xb5873) [0x7f26224ee873] 8: (()+0xb596e) [0x7f26224ee96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x64ffaf] 10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77] 11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b] 12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617] 13: (Monitor::init_paxos()+0xf5) [0x48e7d5] 14: (Monitor::preinit()+0x6ac) [0x4a4e6c] 15: (main()+0x1c19) [0x4835c9] 16: (__libc_start_main()+0xed) [0x7f2621b8976d] 17: /usr/bin/ceph-mon() [0x485eed] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mon.ceph3.log --- end dump of recent events --- 2013-07-24 09:42:57.935730 7fb08d67a780 0 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon, pid 19878 2013-07-24 09:42:57.943330 7fb08d67a780 1 mon.ceph3@-1(probing) e1 preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd 2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7fb08d67a780 time 2013-07-24 09:42:57.964379 mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0) ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35) 1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77] 2: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617] 4: (Monitor::init_paxos()+0xf5) [0x48e7d5] 5: (Monitor::preinit()+0x6ac) [0x4a4e6c] 6: (main()+0x1c19) [0x4835c9] 7: (__libc_start_main()+0xed) [0x7fb08b8d576d] 8: /usr/bin/ceph-mon() [0x485eed] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- -25> 2013-07-24 09:42:57.933545 7fb08d67a780 5 asok(0x1a1e000) register_command perfcounters_dump hook 0x1a13010 -24> 2013-07-24 09:42:57.933581 7fb08d67a780 5 asok(0x1a1e000) register_command 1 hook 0x1a13010 -23> 2013-07-24 09:42:57.933584 7fb08d67a780 5 asok(0x1a1e000) register_command perf dump hook 0x1a13010 -22> 2013-07-24 09:42:57.933592 7fb08d67a780 5 asok(0x1a1e000) register_command perfcounters_schema hook 0x1a13010 -21> 2013-07-24 09:42:57.933595 7fb08d67a780 5 asok(0x1a1e000) register_command 2 hook 0x1a13010 -20> 2013-07-24 09:42:57.933597 7fb08d67a780 5 asok(0x1a1e000) register_command perf schema hook 0x1a13010 -19> 2013-07-24 09:42:57.933601 7fb08d67a780 5 asok(0x1a1e000) register_command config show hook 0x1a13010 -18> 2013-07-24 09:42:57.933604 7fb08d67a780 5 asok(0x1a1e000) register_command config set hook 0x1a13010 -17> 2013-07-24 09:42:57.933606 7fb08d67a780 5 asok(0x1a1e000) register_command log flush hook 0x1a13010 -16> 2013-07-24 09:42:57.933609 7fb08d67a780 5 asok(0x1a1e00
Re: [ceph-users] v0.61.6 Cuttlefish update released
On 2013-07-25 11:52, Wido den Hollander wrote: On 07/25/2013 11:46 AM, pe...@2force.nl wrote: Any news on this? I'm not sure if you guys received the link to the log and monitor files. One monitor and osd is still crashing with the error below. I think you are seeing this issue: http://tracker.ceph.com/issues/5737 You can try with new packages from here: http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/ That should resolve it. Wido Hi Wido, This is the same issue I reported earlier with 0.61.5. I applied the above package and the problem was solved. Then 0.61.6 was released with a fix for this issue. I installed 0.61.6 and the issue is back on one of my monitors and I have one osd crashing. So, it seems the bug is still there in 0.61.6 or it is a new bug. It seems the guys from Inktank haven't picked this up yet. Regards, On 2013-07-24 09:57, pe...@2force.nl wrote: Hi Sage, I just had a 0.61.6 monitor crash and one osd. The mon and all osds restarted just fine after the update but it decided to crash after 15 minutes orso. See a snippet of the logfile below. I have you sent a link to the logfiles and monitor store. It seems the bug hasn't been fully fixed or something else is going on. I have to note though that I had one monitor with a clock skew warning for a few minutes (this happened because of a reboot it was fixed by ntp). So beware when upgrading. Cheers, mon: --- begin dump of recent events --- 0> 2013-07-24 09:42:57.655257 7f262392e780 -1 *** Caught signal (Aborted) ** in thread 7f262392e780 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35) 1: /usr/bin/ceph-mon() [0x597cfa] 2: (()+0xfcb0) [0x7f2622fc8cb0] 3: (gsignal()+0x35) [0x7f2621b9e425] 4: (abort()+0x17b) [0x7f2621ba1b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f26224f069d] 6: (()+0xb5846) [0x7f26224ee846] 7: (()+0xb5873) [0x7f26224ee873] 8: (()+0xb596e) [0x7f26224ee96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x64ffaf] 10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77] 11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b] 12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617] 13: (Monitor::init_paxos()+0xf5) [0x48e7d5] 14: (Monitor::preinit()+0x6ac) [0x4a4e6c] 15: (main()+0x1c19) [0x4835c9] 16: (__libc_start_main()+0xed) [0x7f2621b8976d] 17: /usr/bin/ceph-mon() [0x485eed] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mon.ceph3.log --- end dump of recent events --- 2013-07-24 09:42:57.935730 7fb08d67a780 0 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon, pid 19878 2013-07-24 09:42:57.943330 7fb08d67a780 1 mon.ceph3@-1(probing) e1 preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd 2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7fb08d67a780 time 2013-07-24 09:42:57.964379 mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0) ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35) 1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77] 2: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617] 4: (Monitor::init_paxos()+0xf5) [0x48e7d5] 5: (Monitor::preinit()+0x6ac) [0x4a4e6c] 6: (main()+0x1c19) [0x4835c9] 7: (__libc_start_main()+0xed) [0x7fb08b8d576d] 8: /usr/bin/ceph-mon() [0x485eed] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- -25> 2013-07-24 09:42:57.933545 7fb08d67a780 5 asok(0x1a1e000) register_command perfcounters_dump hook 0x1a13010 -24> 2013-07-24 09:42:57.933581 7fb08d67a780 5 asok(0x1a1e000) register_command 1 hook 0x1a13010 -23> 2013-07-24 09:42:57.933584 7fb08d67a780 5 asok(0x1a1e000) register_command perf dump hook 0x1a13010 -22> 2013-07-24 09:42:57.933592 7fb08d67a780 5 asok(0x1a1e000) register_command perfcounters_schema hook 0x1a13010 -21> 2013-07-24 09:42:57.933595 7fb08d67a780 5 asok(0x1a1e000) register_command 2 hook 0x1a13010 -20> 2013-07-24 09:42:57.933597 7fb08d
Re: [ceph-users] v0.61.6 Cuttlefish update released
On 07/25/2013 12:01 PM, pe...@2force.nl wrote: On 2013-07-25 11:52, Wido den Hollander wrote: On 07/25/2013 11:46 AM, pe...@2force.nl wrote: Any news on this? I'm not sure if you guys received the link to the log and monitor files. One monitor and osd is still crashing with the error below. I think you are seeing this issue: http://tracker.ceph.com/issues/5737 You can try with new packages from here: http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/ That should resolve it. Wido Hi Wido, This is the same issue I reported earlier with 0.61.5. I applied the above package and the problem was solved. Then 0.61.6 was released with a fix for this issue. I installed 0.61.6 and the issue is back on one of my monitors and I have one osd crashing. So, it seems the bug is still there in 0.61.6 or it is a new bug. It seems the guys from Inktank haven't picked this up yet. It has been picked up, Sage mentioned this yesterday on the dev list: "This is fixed in the cuttlefish branch as of earlier this afternoon. I've spent most of the day expanding the automated test suite to include upgrade combinations to trigger this and *finally* figured out that this particular problem seems to surface on clusters that upgraded from bobtail-> cuttlefish but not clusters created on cuttlefish. If you've run into this issue, please use the cuttlefish branch build for now. We will have a release out in the next day or so that includes this and a few other pending fixes. I'm sorry we missed this one! The upgrade test matrix I've been working on today should catch this type of issue in the future." Wido Regards, On 2013-07-24 09:57, pe...@2force.nl wrote: Hi Sage, I just had a 0.61.6 monitor crash and one osd. The mon and all osds restarted just fine after the update but it decided to crash after 15 minutes orso. See a snippet of the logfile below. I have you sent a link to the logfiles and monitor store. It seems the bug hasn't been fully fixed or something else is going on. I have to note though that I had one monitor with a clock skew warning for a few minutes (this happened because of a reboot it was fixed by ntp). So beware when upgrading. Cheers, mon: --- begin dump of recent events --- 0> 2013-07-24 09:42:57.655257 7f262392e780 -1 *** Caught signal (Aborted) ** in thread 7f262392e780 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35) 1: /usr/bin/ceph-mon() [0x597cfa] 2: (()+0xfcb0) [0x7f2622fc8cb0] 3: (gsignal()+0x35) [0x7f2621b9e425] 4: (abort()+0x17b) [0x7f2621ba1b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f26224f069d] 6: (()+0xb5846) [0x7f26224ee846] 7: (()+0xb5873) [0x7f26224ee873] 8: (()+0xb596e) [0x7f26224ee96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x64ffaf] 10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77] 11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b] 12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617] 13: (Monitor::init_paxos()+0xf5) [0x48e7d5] 14: (Monitor::preinit()+0x6ac) [0x4a4e6c] 15: (main()+0x1c19) [0x4835c9] 16: (__libc_start_main()+0xed) [0x7f2621b8976d] 17: /usr/bin/ceph-mon() [0x485eed] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mon.ceph3.log --- end dump of recent events --- 2013-07-24 09:42:57.935730 7fb08d67a780 0 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon, pid 19878 2013-07-24 09:42:57.943330 7fb08d67a780 1 mon.ceph3@-1(probing) e1 preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd 2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7fb08d67a780 time 2013-07-24 09:42:57.964379 mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0) ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35) 1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77] 2: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617] 4: (Monitor::init_paxos()+0xf5) [0x48e7d5] 5: (Monitor::preinit()+0x6ac) [0x4a4e6c] 6: (main()+0x1c19) [0x4835c9] 7: (__l
Re: [ceph-users] v0.61.6 Cuttlefish update released
On 2013-07-25 12:08, Wido den Hollander wrote: On 07/25/2013 12:01 PM, pe...@2force.nl wrote: On 2013-07-25 11:52, Wido den Hollander wrote: On 07/25/2013 11:46 AM, pe...@2force.nl wrote: Any news on this? I'm not sure if you guys received the link to the log and monitor files. One monitor and osd is still crashing with the error below. I think you are seeing this issue: http://tracker.ceph.com/issues/5737 You can try with new packages from here: http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/ That should resolve it. Wido Hi Wido, This is the same issue I reported earlier with 0.61.5. I applied the above package and the problem was solved. Then 0.61.6 was released with a fix for this issue. I installed 0.61.6 and the issue is back on one of my monitors and I have one osd crashing. So, it seems the bug is still there in 0.61.6 or it is a new bug. It seems the guys from Inktank haven't picked this up yet. It has been picked up, Sage mentioned this yesterday on the dev list: "This is fixed in the cuttlefish branch as of earlier this afternoon. I've spent most of the day expanding the automated test suite to include upgrade combinations to trigger this and *finally* figured out that this particular problem seems to surface on clusters that upgraded from bobtail-> cuttlefish but not clusters created on cuttlefish. If you've run into this issue, please use the cuttlefish branch build for now. We will have a release out in the next day or so that includes this and a few other pending fixes. I'm sorry we missed this one! The upgrade test matrix I've been working on today should catch this type of issue in the future." Wido Regards, We created this cluster on cuttlefish and not on bobtail so it doesn't apply. I'm not sure if it is clear what I am trying to say or that I'm missing something here but I still see this issue either way :-) I will check out the dev list also but perhaps someone from Inktank can at least look at the files I provided. Peter On 2013-07-24 09:57, pe...@2force.nl wrote: Hi Sage, I just had a 0.61.6 monitor crash and one osd. The mon and all osds restarted just fine after the update but it decided to crash after 15 minutes orso. See a snippet of the logfile below. I have you sent a link to the logfiles and monitor store. It seems the bug hasn't been fully fixed or something else is going on. I have to note though that I had one monitor with a clock skew warning for a few minutes (this happened because of a reboot it was fixed by ntp). So beware when upgrading. Cheers, mon: --- begin dump of recent events --- 0> 2013-07-24 09:42:57.655257 7f262392e780 -1 *** Caught signal (Aborted) ** in thread 7f262392e780 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35) 1: /usr/bin/ceph-mon() [0x597cfa] 2: (()+0xfcb0) [0x7f2622fc8cb0] 3: (gsignal()+0x35) [0x7f2621b9e425] 4: (abort()+0x17b) [0x7f2621ba1b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f26224f069d] 6: (()+0xb5846) [0x7f26224ee846] 7: (()+0xb5873) [0x7f26224ee873] 8: (()+0xb596e) [0x7f26224ee96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x64ffaf] 10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77] 11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b] 12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617] 13: (Monitor::init_paxos()+0xf5) [0x48e7d5] 14: (Monitor::preinit()+0x6ac) [0x4a4e6c] 15: (main()+0x1c19) [0x4835c9] 16: (__libc_start_main()+0xed) [0x7f2621b8976d] 17: /usr/bin/ceph-mon() [0x485eed] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mon.ceph3.log --- end dump of recent events --- 2013-07-24 09:42:57.935730 7fb08d67a780 0 ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon, pid 19878 2013-07-24 09:42:57.943330 7fb08d67a780 1 mon.ceph3@-1(probing) e1 preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd 2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7fb08d67a780 time 2013-07-24 09:42:57.964379 mon/OSDMonitor.cc: 167: FAI
[ceph-users] A lot of pools?
I think to make pool-per-user (primary for cephfs; for security, quota, etc), hundreds (or even more) of them. But I remember 2 facts: 1) info in manual about slowdown on many pools; 2) something in later changelog about hashed pool IDs (?). How about now and numbers of pools? And how to avoid serious overheads? -- WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.61.6 Cuttlefish update released
On 07/25/2013 11:20 AM, pe...@2force.nl wrote: On 2013-07-25 12:08, Wido den Hollander wrote: On 07/25/2013 12:01 PM, pe...@2force.nl wrote: On 2013-07-25 11:52, Wido den Hollander wrote: On 07/25/2013 11:46 AM, pe...@2force.nl wrote: Any news on this? I'm not sure if you guys received the link to the log and monitor files. One monitor and osd is still crashing with the error below. I think you are seeing this issue: http://tracker.ceph.com/issues/5737 You can try with new packages from here: http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/ That should resolve it. Wido Hi Wido, This is the same issue I reported earlier with 0.61.5. I applied the above package and the problem was solved. Then 0.61.6 was released with a fix for this issue. I installed 0.61.6 and the issue is back on one of my monitors and I have one osd crashing. So, it seems the bug is still there in 0.61.6 or it is a new bug. It seems the guys from Inktank haven't picked this up yet. It has been picked up, Sage mentioned this yesterday on the dev list: "This is fixed in the cuttlefish branch as of earlier this afternoon. I've spent most of the day expanding the automated test suite to include upgrade combinations to trigger this and *finally* figured out that this particular problem seems to surface on clusters that upgraded from bobtail-> cuttlefish but not clusters created on cuttlefish. If you've run into this issue, please use the cuttlefish branch build for now. We will have a release out in the next day or so that includes this and a few other pending fixes. I'm sorry we missed this one! The upgrade test matrix I've been working on today should catch this type of issue in the future." Wido Regards, We created this cluster on cuttlefish and not on bobtail so it doesn't apply. I'm not sure if it is clear what I am trying to say or that I'm missing something here but I still see this issue either way :-) I will check out the dev list also but perhaps someone from Inktank can at least look at the files I provided. Peter, We did take a look at your files (thanks a lot btw!), and as of last night's patches (which are now on the cuttlefish branch), your store worked just fine. As Sage mentioned on ceph-devel, one of the issues would only happen on a bobtail -> cuttlefish cluster. That is not your issue though. I believe Sage meant the FAILED assert(latest_full > 0) -- i.e., the one reported on #5737. Your issue however was caused by a bug on a patch meant to fix #5704. It made an on-disk key to be updated erroneously with a value for a version that did not yet existed at the time update_from_paxos() was called. In a nutshell, one of the latest patches (see 115468c73f121653eec2efc030d5ba998d834e43) fixed that issue and another patch (see 27f31895664fa7f10c1617d486f2a6ece0f97091) worked around it. A point-release should come out soon, but in the mean time the cuttlefish branch should be safe to use. If you run into any other issues, please let us know. -Joao -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.61.6 Cuttlefish update released
On 2013-07-25 15:21, Joao Eduardo Luis wrote: On 07/25/2013 11:20 AM, pe...@2force.nl wrote: On 2013-07-25 12:08, Wido den Hollander wrote: On 07/25/2013 12:01 PM, pe...@2force.nl wrote: On 2013-07-25 11:52, Wido den Hollander wrote: On 07/25/2013 11:46 AM, pe...@2force.nl wrote: Any news on this? I'm not sure if you guys received the link to the log and monitor files. One monitor and osd is still crashing with the error below. I think you are seeing this issue: http://tracker.ceph.com/issues/5737 You can try with new packages from here: http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/ That should resolve it. Wido Hi Wido, This is the same issue I reported earlier with 0.61.5. I applied the above package and the problem was solved. Then 0.61.6 was released with a fix for this issue. I installed 0.61.6 and the issue is back on one of my monitors and I have one osd crashing. So, it seems the bug is still there in 0.61.6 or it is a new bug. It seems the guys from Inktank haven't picked this up yet. It has been picked up, Sage mentioned this yesterday on the dev list: "This is fixed in the cuttlefish branch as of earlier this afternoon. I've spent most of the day expanding the automated test suite to include upgrade combinations to trigger this and *finally* figured out that this particular problem seems to surface on clusters that upgraded from bobtail-> cuttlefish but not clusters created on cuttlefish. If you've run into this issue, please use the cuttlefish branch build for now. We will have a release out in the next day or so that includes this and a few other pending fixes. I'm sorry we missed this one! The upgrade test matrix I've been working on today should catch this type of issue in the future." Wido Regards, We created this cluster on cuttlefish and not on bobtail so it doesn't apply. I'm not sure if it is clear what I am trying to say or that I'm missing something here but I still see this issue either way :-) I will check out the dev list also but perhaps someone from Inktank can at least look at the files I provided. Peter, We did take a look at your files (thanks a lot btw!), and as of last night's patches (which are now on the cuttlefish branch), your store worked just fine. As Sage mentioned on ceph-devel, one of the issues would only happen on a bobtail -> cuttlefish cluster. That is not your issue though. I believe Sage meant the FAILED assert(latest_full > 0) -- i.e., the one reported on #5737. Your issue however was caused by a bug on a patch meant to fix #5704. It made an on-disk key to be updated erroneously with a value for a version that did not yet existed at the time update_from_paxos() was called. In a nutshell, one of the latest patches (see 115468c73f121653eec2efc030d5ba998d834e43) fixed that issue and another patch (see 27f31895664fa7f10c1617d486f2a6ece0f97091) worked around it. A point-release should come out soon, but in the mean time the cuttlefish branch should be safe to use. If you run into any other issues, please let us know. -Joao Hi Joao, I installed the packages from that branch but I still see the same crashes: root@ceph3:~/ceph# ceph-mon -v ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3) root@ceph3:~/ceph# ceph-osd -v ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3) Both monitor and one of three osds (on that host) still crash on startup. I must be doing something wrong if it works for you... OSD: --- begin dump of recent events --- 0> 2013-07-25 15:35:32.563404 7f8172241700 -1 *** Caught signal (Aborted) ** in thread 7f8172241700 ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3) 1: /usr/bin/ceph-osd() [0x79430a] 2: (()+0xfcb0) [0x7f81833e1cb0] 3: (gsignal()+0x35) [0x7f81814af425] 4: (abort()+0x17b) [0x7f81814b2b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8181e0169d] 6: (()+0xb5846) [0x7f8181dff846] 7: (()+0xb5873) [0x7f8181dff873] 8: (()+0xb596e) [0x7f8181dff96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x84618f] 10: (OSDService::get_map(unsigned int)+0x428) [0x63bc48] 11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set, std::less >, std::allocator > >*)+0x11d) [0x63d77d] 12: (OSD::process_peering_events(std::list > const&, ThreadPool::TPHandle&)+0x244) [0x63ded4] 13: (OSD::PeeringWQ::_process(std::list > const&, ThreadPool::TPHandle&)+0x12) [0x678c52] 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x83b5c6] 15: (ThreadPool::WorkThread::entry()+0x10) [0x83d3f0] 16: (()+0x7e9a) [0x7f81833d9e9a] 17: (clone()+0x6d) [0x7f818156cccd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locke
[ceph-users] ceph-deploy and bugs 5195/5205: mon.host1 does not exist in monmap, will attempt to join an existing cluster
Hi List, I've been having issues getting mons deployed following the ceph-deploy instructions here[0]. My steps were: $ ceph-deploy new host{1..3} $ vi ceph.conf # Add in public network/cluster network details, as well as change the mon IPs to those on the correct interface $ ceph-deploy install host{1..3} $ ceph-deploy mon create host{1..3} The next step would be to run "ceph-deploy gatherkeys host1", but this fails, as the /var/lib/ceph/bootstrap-{osd,mds} directories are both empty. Checking the logs in /var/log/ceph uncovers an assertion failure as in the referenced bugs[1][2], which ought to have been fixed in version 0.61.5. I have version 0.61.6 running on Ubuntu 13.04 hosts, so I'm at a loss for why this is happening. I've tried with and without the "public network" variable being set, but it fails in the same way either way round. Any help much appreciated, Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.61.6 Cuttlefish update released
On 07/25/2013 02:39 PM, pe...@2force.nl wrote: On 2013-07-25 15:21, Joao Eduardo Luis wrote: On 07/25/2013 11:20 AM, pe...@2force.nl wrote: On 2013-07-25 12:08, Wido den Hollander wrote: On 07/25/2013 12:01 PM, pe...@2force.nl wrote: On 2013-07-25 11:52, Wido den Hollander wrote: On 07/25/2013 11:46 AM, pe...@2force.nl wrote: Any news on this? I'm not sure if you guys received the link to the log and monitor files. One monitor and osd is still crashing with the error below. I think you are seeing this issue: http://tracker.ceph.com/issues/5737 You can try with new packages from here: http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/ That should resolve it. Wido Hi Wido, This is the same issue I reported earlier with 0.61.5. I applied the above package and the problem was solved. Then 0.61.6 was released with a fix for this issue. I installed 0.61.6 and the issue is back on one of my monitors and I have one osd crashing. So, it seems the bug is still there in 0.61.6 or it is a new bug. It seems the guys from Inktank haven't picked this up yet. It has been picked up, Sage mentioned this yesterday on the dev list: "This is fixed in the cuttlefish branch as of earlier this afternoon. I've spent most of the day expanding the automated test suite to include upgrade combinations to trigger this and *finally* figured out that this particular problem seems to surface on clusters that upgraded from bobtail-> cuttlefish but not clusters created on cuttlefish. If you've run into this issue, please use the cuttlefish branch build for now. We will have a release out in the next day or so that includes this and a few other pending fixes. I'm sorry we missed this one! The upgrade test matrix I've been working on today should catch this type of issue in the future." Wido Regards, We created this cluster on cuttlefish and not on bobtail so it doesn't apply. I'm not sure if it is clear what I am trying to say or that I'm missing something here but I still see this issue either way :-) I will check out the dev list also but perhaps someone from Inktank can at least look at the files I provided. Peter, We did take a look at your files (thanks a lot btw!), and as of last night's patches (which are now on the cuttlefish branch), your store worked just fine. As Sage mentioned on ceph-devel, one of the issues would only happen on a bobtail -> cuttlefish cluster. That is not your issue though. I believe Sage meant the FAILED assert(latest_full > 0) -- i.e., the one reported on #5737. Your issue however was caused by a bug on a patch meant to fix #5704. It made an on-disk key to be updated erroneously with a value for a version that did not yet existed at the time update_from_paxos() was called. In a nutshell, one of the latest patches (see 115468c73f121653eec2efc030d5ba998d834e43) fixed that issue and another patch (see 27f31895664fa7f10c1617d486f2a6ece0f97091) worked around it. A point-release should come out soon, but in the mean time the cuttlefish branch should be safe to use. If you run into any other issues, please let us know. -Joao Hi Joao, I installed the packages from that branch but I still see the same crashes: root@ceph3:~/ceph# ceph-mon -v ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3) root@ceph3:~/ceph# ceph-osd -v ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3) Both monitor and one of three osds (on that host) still crash on startup. I must be doing something wrong if it works for you... Yep. Your monitors are on the wrong branch. 28720b0b4d55ef98f3b7d0855b18339e75f759e3 is wip-5737-cuttlefish's head. That branch lacks an essential patch. You should be running on the cuttlefish branch instead (24a56a9637afd8c64b71d264359c78a25d52be02). -Joao OSD: --- begin dump of recent events --- 0> 2013-07-25 15:35:32.563404 7f8172241700 -1 *** Caught signal (Aborted) ** in thread 7f8172241700 ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3) 1: /usr/bin/ceph-osd() [0x79430a] 2: (()+0xfcb0) [0x7f81833e1cb0] 3: (gsignal()+0x35) [0x7f81814af425] 4: (abort()+0x17b) [0x7f81814b2b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8181e0169d] 6: (()+0xb5846) [0x7f8181dff846] 7: (()+0xb5873) [0x7f8181dff873] 8: (()+0xb596e) [0x7f8181dff96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x84618f] 10: (OSDService::get_map(unsigned int)+0x428) [0x63bc48] 11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set, std::less >, std::allocator > >*)+0x11d) [0x63d77d] 12: (OSD::process_peering_events(std::list > const&, ThreadPool::TPHandle&)+0x244) [0x63ded4] 13: (OSD::PeeringWQ::_process(std::list > const&, ThreadPool::TPHandle&)+0x12) [0x678c52] 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x83b5c6] 15: (ThreadPool::WorkThread::entry()+0x10)
Re: [ceph-users] v0.61.6 Cuttlefish update released
On 2013-07-25 15:55, Joao Eduardo Luis wrote: On 07/25/2013 02:39 PM, pe...@2force.nl wrote: On 2013-07-25 15:21, Joao Eduardo Luis wrote: On 07/25/2013 11:20 AM, pe...@2force.nl wrote: On 2013-07-25 12:08, Wido den Hollander wrote: On 07/25/2013 12:01 PM, pe...@2force.nl wrote: On 2013-07-25 11:52, Wido den Hollander wrote: On 07/25/2013 11:46 AM, pe...@2force.nl wrote: Any news on this? I'm not sure if you guys received the link to the log and monitor files. One monitor and osd is still crashing with the error below. I think you are seeing this issue: http://tracker.ceph.com/issues/5737 You can try with new packages from here: http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/ That should resolve it. Wido Hi Wido, This is the same issue I reported earlier with 0.61.5. I applied the above package and the problem was solved. Then 0.61.6 was released with a fix for this issue. I installed 0.61.6 and the issue is back on one of my monitors and I have one osd crashing. So, it seems the bug is still there in 0.61.6 or it is a new bug. It seems the guys from Inktank haven't picked this up yet. It has been picked up, Sage mentioned this yesterday on the dev list: "This is fixed in the cuttlefish branch as of earlier this afternoon. I've spent most of the day expanding the automated test suite to include upgrade combinations to trigger this and *finally* figured out that this particular problem seems to surface on clusters that upgraded from bobtail-> cuttlefish but not clusters created on cuttlefish. If you've run into this issue, please use the cuttlefish branch build for now. We will have a release out in the next day or so that includes this and a few other pending fixes. I'm sorry we missed this one! The upgrade test matrix I've been working on today should catch this type of issue in the future." Wido Regards, We created this cluster on cuttlefish and not on bobtail so it doesn't apply. I'm not sure if it is clear what I am trying to say or that I'm missing something here but I still see this issue either way :-) I will check out the dev list also but perhaps someone from Inktank can at least look at the files I provided. Peter, We did take a look at your files (thanks a lot btw!), and as of last night's patches (which are now on the cuttlefish branch), your store worked just fine. As Sage mentioned on ceph-devel, one of the issues would only happen on a bobtail -> cuttlefish cluster. That is not your issue though. I believe Sage meant the FAILED assert(latest_full > 0) -- i.e., the one reported on #5737. Your issue however was caused by a bug on a patch meant to fix #5704. It made an on-disk key to be updated erroneously with a value for a version that did not yet existed at the time update_from_paxos() was called. In a nutshell, one of the latest patches (see 115468c73f121653eec2efc030d5ba998d834e43) fixed that issue and another patch (see 27f31895664fa7f10c1617d486f2a6ece0f97091) worked around it. A point-release should come out soon, but in the mean time the cuttlefish branch should be safe to use. If you run into any other issues, please let us know. -Joao Hi Joao, I installed the packages from that branch but I still see the same crashes: root@ceph3:~/ceph# ceph-mon -v ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3) root@ceph3:~/ceph# ceph-osd -v ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3) Both monitor and one of three osds (on that host) still crash on startup. I must be doing something wrong if it works for you... Yep. Your monitors are on the wrong branch. 28720b0b4d55ef98f3b7d0855b18339e75f759e3 is wip-5737-cuttlefish's head. That branch lacks an essential patch. You should be running on the cuttlefish branch instead (24a56a9637afd8c64b71d264359c78a25d52be02). -Joao Ah yes, I see now. Ok, this worked for the mon, it is running again. The osd is still crashing, though. Any ideas on that? root@ceph3:~/ceph# ceph-osd -v ceph version 0.61.6-15-g24a56a9 (24a56a9637afd8c64b71d264359c78a25d52be02) root@ceph3:~/ceph# ceph-mon -v ceph version 0.61.6-15-g24a56a9 (24a56a9637afd8c64b71d264359c78a25d52be02) OSD: --- begin dump of recent events --- 0> 2013-07-25 15:35:32.563404 7f8172241700 -1 *** Caught signal (Aborted) ** in thread 7f8172241700 ceph version 0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3) 1: /usr/bin/ceph-osd() [0x79430a] 2: (()+0xfcb0) [0x7f81833e1cb0] 3: (gsignal()+0x35) [0x7f81814af425] 4: (abort()+0x17b) [0x7f81814b2b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8181e0169d] 6: (()+0xb5846) [0x7f8181dff846] 7: (()+0xb5873) [0x7f8181dff873] 8: (()+0xb596e) [0x7f8181dff96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x84618f] 10: (OSDService::get_map(unsigned int)+0x428) [0x63bc48] 11: (OSD::a
Re: [ceph-users] ceph-deploy and bugs 5195/5205: mon.host1 does not exist in monmap, will attempt to join an existing cluster
Links I forgot to include the first time: [0] http://ceph.com/docs/master/rados/deployment/ceph-deploy-install/ [1] http://tracker.ceph.com/issues/5195 [2] http://tracker.ceph.com/issues/5205 Apologies for the noise, Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] testing ceph - very slow write performances
Hi ceph-users, I'm actually evaluating ceph for a project and I'm getting quite low write performances, so please if you have time reading this post and give me some advices :) My test setup using some free hardware we have laying in our datacenter: Three ceph server nodes, on each one is running a monitor and two OSDs and one client node Hardware of a node: (supermicro stuff) Intel(R) Xeon(R) CPU X3440 @ 2.53GHz (total of 8 logical cores) 2 x Western Digital Caviar Black 1TO (WD1003FBYX-01Y7B0) 32 GB RAM DDR3 2 x Ehernet controller: Intel Corporation 82574L Gigabit Network Connection Hardware of the client: (A dell Blade M610) Dual Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (total of 16 logical cores) 64 GB RAM DDR3 4 x Ethernet controller: Broadcom Corporation NetXtreme II BCM5709S Gigabit Ethernet (rev 20) 2 x Ethernet controller: Broadcom Corporation NetXtreme II BCM57711 10-Gigabit PCIe OS of the server nodes: Ubuntu 12.04.2 LTS Kernel 3.10.0-031000-generic #201306301935 SMP Sun Jun 30 23:36:16 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux OS of the client node: CentOS release 6.4 Kernel 3.10.1-1.el6xen.x86_64 #1 SMP Sun Jul 14 11:05:42 EST 2013 x86_64 x86_64 x86_64 GNU/Linux How I did setup the OS (server nodes): I know this isn't good but as there is only two disk in the machine I've partitionned the disks and used them both for the OS and the OSDs, but well for a test run it shouldn't be that bad... Disks layout: partition 1: mdadm raid 1 member for the OS (30gb) partition 2: mdadm raid 1 member for some swapspace (shouldn't be used anyway...) partition 3: reserved for xfs partition for OSDs Ceph installation: Tried both cuttlefish (0.56) and testing (0.66). Deployed using ceph-deploy from an admin node running on a xenserver 6.2 VM. #ceph-deploy new ceph01 ceph02 ceph03 (edited some ceph.conf stuff) #ceph-deploy install --stable cuttlefish ceph01 ceph02 ceph03 #ceph-deploy mon create ceph01 ceph02 ceph03 #ceph-deploy gatherkeys ceph01 #ceph-deploy osd create ceph01:/dev/sda3 ceph01:/dev/sdb3 ceph02:/dev/sda3 ceph02:/dev/sdb3 ceph03:/dev/sda3 ceph03:/dev/sdb3 #ceph-deploy osd activate ceph01:/dev/sda3 ceph01:/dev/sdb3 ceph02:/dev/sda3 ceph02:/dev/sdb3 ceph03:/dev/sda3 ceph03:/dev/sdb3 ceph-admin:~/cephstore$ ceph status health HEALTH_OK monmap e1: 3 mons at {ceph01=10.111.80.1:6789/0,ceph02=10.111.80.2:6789/0,ceph03=10.111.80.3:6789/0}, election epoch 6, quorum 0,1,2 ceph01,ceph02,ceph03 osdmap e26: 6 osds: 6 up, 6 in pgmap v258: 192 pgs: 192 active+clean; 1000 MB data, 62212 MB used, 5346 GB / 5407 GB avail mdsmap e1: 0/0/1 up Now let's do some performance testing from the client, accessing a rbd on the cluster. #rbd create test --size 2 #rbd map test raw write test (ouch something is wrong here) #dd if=/dev/zero of=/dev/rbd1 bs=1024k count=1000 oflag=direct 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 146.051 s, 7.2 MB/s raw read test (this seems quite ok for a gbit network) #dd if=/dev/rbd1 of=/dev/null bs=1024k count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 13.6368 s, 76.9 MB/s Trying to find the bottleneck networking testing between client and nodes (not 100% efficiency but not that bad) [ 3] local 10.111.80.1 port 37497 connected with 10.111.10.105 port 5001 [ 3] 0.0-10.0 sec 812 MBytes 681 Mbits/sec [ 3] local 10.111.80.2 port 55912 connected with 10.111.10.105 port 5001 [ 3] 0.0-10.0 sec 802 MBytes 673 Mbits/sec [ 3] local 10.111.80.3 port 45188 connected with 10.111.10.105 port 5001 [ 3] 0.0-10.1 sec 707 MBytes 589 Mbits/sec [ 3] local 10.111.10.105 port 43103 connected with 10.111.80.1 port 5001 [ 3] 0.0-10.2 sec 730 MBytes 601 Mbits/sec [ 3] local 10.111.10.105 port 44656 connected with 10.111.80.2 port 5001 [ 3] 0.0-10.0 sec 871 MBytes 730 Mbits/sec [ 3] local 10.111.10.105 port 40455 connected with 10.111.80.3 port 5001 [ 3] 0.0-10.0 sec 1005 MBytes 843 Mbits/sec Disk throughput on the ceph nodes /var/lib/ceph/osd/ceph-0$ sudo dd if=/dev/zero of=test bs=1024k count=1000 oflag=direct 1048576000 bytes (1.0 GB) copied, 7.96581 s, 132 MB/s /var/lib/ceph/osd/ceph-1$ sudo dd if=/dev/zero of=test bs=1024k count=1000 oflag=direct 1048576000 bytes (1.0 GB) copied, 7.91835 s, 132 MB/s /var/lib/ceph/osd/ceph-2$ sudo dd if=/dev/zero of=test bs=1024k count=1000 oflag=direct 1048576000 bytes (1.0 GB) copied, 7.55287 s, 139 MB/s /var/lib/ceph/osd/ceph-3$ sudo dd if=/dev/zero of=test bs=1024k count=1000 oflag=direct 1048576000 bytes (1.0 GB) copied, 7.67281 s, 137 MB/s /var/lib/ceph/osd/ceph-4$ sudo dd if=/dev/zero of=test bs=1024k count=1000 oflag=direct 1048576000 bytes (1.0 GB) copied, 8.13862 s, 129 MB/s /var/lib/ceph/osd/ceph-5$ sudo dd if=/dev/zero of=test bs=1024k count=1000 oflag=direct 1048576000 bytes (1.0 GB) copied, 7.72034 s, 136 MB/s Actually I don't know what else to check. So let me ask if that
Re: [ceph-users] A lot of pools?
On Thursday, July 25, 2013, Dzianis Kahanovich wrote: > I think to make pool-per-user (primary for cephfs; for security, quota, > etc), > hundreds (or even more) of them. But I remember 2 facts: > 1) info in manual about slowdown on many pools; Yep, this is still a problem; pool-per-user isn't going to work well unless your users are covering truly prodigious amounts of space. There is a new feature in raw RADOS that let you specify a separate "namespace" and set access capabilities based on those; it is being worked up through the rest of the stack now. > 2) something in later changelog about hashed pool IDs (?). This doesn't impact things one way or the other. -Greg > > How about now and numbers of pools? > And how to avoid serious overheads? > > -- > WBR, Dzianis Kahanovich AKA Denis Kaganovich, > http://mahatma.bspu.unibel.by/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy and bugs 5195/5205: mon.host1 does not exist in monmap, will attempt to join an existing cluster
On Thu, 25 Jul 2013, Josh Holland wrote: > Hi List, > > I've been having issues getting mons deployed following the > ceph-deploy instructions here[0]. My steps were: > > $ ceph-deploy new host{1..3} > $ vi ceph.conf # Add in public network/cluster network details, as > well as change the mon IPs to those on the correct interface > $ ceph-deploy install host{1..3} > $ ceph-deploy mon create host{1..3} > > The next step would be to run "ceph-deploy gatherkeys host1", but this > fails, as the /var/lib/ceph/bootstrap-{osd,mds} directories are both > empty. Checking the logs in /var/log/ceph uncovers an assertion > failure as in the referenced bugs[1][2], which ought to have been > fixed in version 0.61.5. I have version 0.61.6 running on Ubuntu 13.04 > hosts, so I'm at a loss for why this is happening. I've tried with and > without the "public network" variable being set, but it fails in the > same way either way round. I suspect the difference here is that the dns names you are specifying in ceph-deploy new do not match. Are you adjusting the 'mon host' line in ceph.conf? Note that you can specify a fqdn to ceph-deploy new and it will take the first name to be the hostname, or you can specify 'ceph-deploy new name:fqdn_or_ip'. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Xen-API] The vdi is not available
mount.nfs 10.254.253.9:/xen/9f9aa794-86c0-9c36-a99d-1e5fdc14a206 -o soft,timeo=133,retrans=2147483647,tcp,noac this gives mount -o doesnt exist Moya Solutions, Inc. am...@moyasolutions.com 0 | 646-918-5238 x 102 F | 646-390-1806 - Original Message - From: "Sébastien RICCIO" To: "Andres E. Moya" , xen-...@lists.xen.org Sent: Thursday, July 25, 2013 1:08:02 PM Subject: Re: [Xen-API] The vdi is not available I don't get why it's not mounting with the uuid subdir. It should. On our pool: Jul 25 10:13:39 xen-blade10 SM: [30890] ['mount.nfs', '10.50.50.11:/storage/nfs1/cc744878-9d79-37df-98cb-cd88eebdab61', '/var/run/sr-mount/cc744878-9d79-37df-98cb-cd88eebdab61', '-o', 'soft,timeo=133,retrans=2147483647,tcp,actimeo=0'] as a temporary dirty fix you could try: umount /var/run/sr-mount/9f9aa794-86c0-9c36-a99d-1e5fdc14a206 mount.nfs 10.254.253.9:/xen/9f9aa794-86c0-9c36-a99d-1e5fdc14a206 -o soft,timeo=133,retrans=2147483647,tcp,noac to manually remount it correctly On 25.07.2013 18:48, Andres E. Moya wrote: I restarted and tried to unplug and got the same message, here is the grep [root@nj-xen-04 ~]# grep mount.nfs /var/log/SMlog [31636] 2013-07-24 16:43:54.140961 ['mount.nfs', '10.254.253.9:/secondary', '/var/run/sr-mount/f21def12-74a2-8fab-1e1c-f41968e889bb', '-o', 'soft,timeo=133,retrans=2147483647,tcp,noac'] [9277] 2013-07-25 12:36:42.416286 ['mount.nfs', '10.254.253.9:/iso', '/var/run/sr-mount/fbfbf5b3-a37a-288a-86aa-d8d168173f98', '-o', 'soft,timeo=133,retrans=2147483647,tcp,noac'] [9393] 2013-07-25 12:36:43.241531 ['mount.nfs', '10.254.253.9:/xen', '/var/run/sr-mount/9f9aa794-86c0-9c36-a99d-1e5fdc14a206', '-o', 'soft,timeo=133,retrans=2147483647,tcp,noac'] - Original Message - From: "Sébastien RICCIO" To: "Andres E. Moya" , xen-...@lists.xen.org Sent: Thursday, July 25, 2013 12:29:24 PM Subject: Re: [Xen-API] The vdi is not available Okay, in this case try to reboot the server, and take a look if it fixed the mount. If not you should "grep mount.nfs /var/log/SMlog" and look what command line XS use to mount your storage. On 25.07.2013 18:22, Andres E. Moya wrote: there are no tasks/ returns empty Moya Solutions, Inc. am...@moyasolutions.com 0 | 646-918-5238 x 102 F | 646-390-1806 - Original Message - From: "Sébastien RICCIO" To: "Andres E. Moya" , xen-...@lists.xen.org Sent: Thursday, July 25, 2013 12:20:05 PM Subject: Re: [Xen-API] The vdi is not available xe task-list uuid=9c7b7690-a301-41ef-b7d5-d4abd8b70fbc If it returns something xe task-cancel uuid=9c7b7690-a301-41ef-b7d5-d4abd8b70fbc then try again to unplug the pbd OR if nothing is running on the server, consider trying a reboot Sorry this is hard to debug remotely. On 25.07.2013 18:10, Andres E. Moya wrote: xe pbd-unplug uuid=a0739a97-408b-afed-7ac2-fe76ffec3ee7 This operation cannot be performed because this VDI is in use by some other operation vdi: 96c158d3-2b31-41d1-8287-aa9fb6d5eb6c (Windows Server 2003 0) operation: 9c7b7690-a301-41ef-b7d5-d4abd8b70fbc (Windows 7 (64-bit) (1) 0) : 405f6cce-d750-47e1-aec3-c8f8f3ae6290 (Plesk Management 0) : dad9b85a-ee2f-4b48-94f0-79db8dfd78dd (mx5 0) : 13b558f8-0c3f-4df9-8766-d8e1306b25d5 (Windows Server 2008 R2 (64-bit) (1) 0) this was done on the server that has nothing running on it Moya Solutions, Inc. am...@moyasolutions.com 0 | 646-918-5238 x 102 F | 646-390-1806 - Original Message - From: "Sébastien RICCIO" To: "Andres E. Moya" Cc: xen-...@lists.xen.org Sent: Thursday, July 25, 2013 12:02:12 PM Subject: Re: [Xen-API] The vdi is not available This looks correct. You should maybe try to unplug / replug the storage on server where it's wrong. for example if it's on nj-xen-03: pbd-unplug uuid=a0739a97-408b-afed-7ac2-fe76ffec3ee7 then pbd-plug uuid=a0739a97-408b-afed-7ac2-fe76ffec3ee7 and check if it's then mounted the right way. On 25.07.2013 17:36, Andres E. Moya wrote: [root@nj-xen-01 ~]# xe pbd-list sr-uuid=9f9aa794-86c0-9c36-a99d-1e5fdc14a206 uuid ( RO) : c53d12f6-c3a6-0ae2-75fb-c67c761b2716 host-uuid ( RO): b8ca0c69-6023-48c5-9b61-bd5871093f4e sr-uuid ( RO): 9f9aa794-86c0-9c36-a99d-1e5fdc14a206 device-config (MRO): serverpath: /xen; options: ; server: 10.254.253.9 currently-attached ( RO): true uuid ( RO) : a0739a97-408b-afed-7ac2-fe76ffec3ee7 host-uuid ( RO): a464b853-47d7-4756-b9ab-49cb00c5aebb sr-uuid ( RO): 9f9aa794-86c0-9c36-a99d-1e5fdc14a206 device-config (MRO): serverpath: /xen; options: ; server: 10.254.253.9 currently-attached ( RO): true uuid ( RO) : 6f2c0e7d-fdda-e406-c2e1-d4ef81552b17 host-uuid ( RO): dab9cd1a-7ca8-4441-a78f-445580d851d2 sr-uuid ( RO): 9f9aa794-86c0-9c36-a99d-1e5fdc14a206 device-config (MRO): serverpath: /xen; options: ; server: 10.
Re: [ceph-users] Flapping osd / continuously reported as failed
On Thu, Jul 25, 2013 at 12:47 AM, Mostowiec Dominik wrote: > Hi > We found something else. > After osd.72 flapp, one PG '3.54d' was recovering long time. > > -- > ceph health details > HEALTH_WARN 1 pgs recovering; recovery 1/39821745 degraded (0.000%) > pg 3.54d is active+recovering, acting [72,108,23] > recovery 1/39821745 degraded (0.000%) > -- > > Last flap down/up osd.72 was 00:45. > In logs we found: > 2013-07-24 00:45:02.736740 7f8ac1e04700 0 log [INF] : 3.54d deep-scrub ok > After this time is ok. > > It is possible that reason of flapping this osd was scrubbing? > > We have default scrubbing settings (ceph version 0.56.6). > If scrubbig is the trouble-maker, can we make it a little more light by > changing config? It's possible, as deep scrub in particular will add a bit of load (it goes through and compares the object contents). Are you not having any flapping issues any more, and did you try and find when it started the scrub to see if it matched up with your troubles? I'd be hesitant to turn it off as scrubbing can uncover corrupt objects etc, but you can configure it with the settings at http://ceph.com/docs/master/rados/configuration/osd-config-ref/#scrubbing. (Always check the surprisingly-helpful docs when you need to do some config or operations work!) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67-rc2 dumpling release candidate
Am Mittwoch, 24. Juli 2013, 22:45:55 schrieb Sage Weil: > Go forth and test! I just upgraded a 0.61.7 cluster to 0.67-rc2. I restarted the mons first, and as expected, they did not join a quorum with the 0.61.7 mons, but after all of the mons were restarted, there was no problem any more. One of my three osds would not come back online after the upgrade, but that is probably just this btrfs bug again: https://bugzilla.kernel.org/show_bug.cgi?id=60603 I will restart the machine tomorrow and see if it comes back. There was one qemu client running at the time of the update doing active IO, and apart from a temporary dip in performance, it was not affected. The whole thing was on Fedora 18, using Kernel 3.9.10-200.fc18.x86_64 and the repository under http://eu.ceph.com/rpm-testing/fc18/x86_64/, and btrfs as the OSD filesystem. Guido ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph monitors stuck in a loop after install with ceph-deploy
On Wed, 24 Jul 2013, pe...@2force.nl wrote: > On 2013-07-24 07:19, Sage Weil wrote: > > On Wed, 24 Jul 2013, S?bastien RICCIO wrote: > > > > > > Hi! While trying to install ceph using ceph-deploy the monitors nodes are > > > stuck waiting on this process: > > > /usr/bin/python /usr/sbin/ceph-create-keys -i a (or b or c) > > > > > > I tried to run mannually the command and it loops on this: > > > connect to /var/run/ceph/ceph-mon.a.asok failed with (2) No such file or > > > directory > > > INFO:ceph-create-keys:ceph-mon admin socket not ready yet. > > > But the existing sock on the nodes are /var/run/ceph/ceph-mon.ceph01.asok > > > > > > Is that a bug in ceph-deploy or maybe my config file is wrong ? > > > > It's the config file. You no longer need to (or should) enumerate the > > daemons in the config file; the sysvinit/upstart scripts find them in > > /var/lib/ceph/{osd,mon,mds}/*. See below: > > > > Hi Sage, > > Does this also apply if you didn't use ceph-deploy (and used the same > directories for mon, osd etc)? Just curious if there are still any > dependencies or if you still need to list those on clients for instance. If you are using ceph-deploy, we touch a file 'sysvinit' or 'upstart' in /var/lib/ceph/osd/*/ that indicates that init system is responsible for that daemon. If it is not present, the scan of those directories on startup will ignore it. In the mkcephfs case, those files aren't present, and you need to instead explicitly enumerate the daemons in ceph.conf with [osd.N] sections and host = foo lines. That will make the sysvinit script start/stop the daemons. So, sysvinit: */sysvinit file or listed in ceph.conf. upstart: */upstart file. Hope that helps! sage > > Cheers, > > Peter > > > > > Version: ceph -v > > > ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35) > > > > > > Note that using "ceph" command line utility on the nodes is working. So it > > > looks that it know the good paths... > > > > > > Config file: > > > > > > [global] > > > fsid = a1394dff-94da-4ef4-a123-55d85e839ffb > > > mon_initial_members = ceph01, ceph02, ceph03 > > > mon_host = 10.111.80.1,10.111.80.2,10.111.80.3 > > > auth_supported = cephx > > > osd_journal_size = 1 > > > filestore_xattr_use_omap = true > > > auth_cluster_required = none > > > auth_service_required = none > > > auth_client_required = none > > > > > > [client] > > > rbd_cache = true > > > rbd_cache_size = 536870912 > > > rbd_cache_max_dirty = 134217728 > > > rbd_cache_target_dirty = 33554432 > > > rbd_cache_max_dirty_age = 5 > > > > > > [osd] > > > osd_data = /var/lib/ceph/osd/ceph-$id > > > osd_journal = /var/lib/ceph/osd/ceph-$id/journal > > > osd_journal_size = 1 > > > osd_mkfs_type = xfs > > > osd_mkfs_options_xfs = "-f -i size=2048" > > > osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k" > > > keyring = /var/lib/ceph/osd/ceph-$id/keyring.osd.$id > > > osd_op_threads = 24 > > > osd_disk_threads = 24 > > > osd_recovery_max_active = 1 > > > journal_dio = true > > > journal_aio = true > > > filestore_max_sync_interval = 100 > > > filestore_min_sync_interval = 50 > > > filestore_queue_max_ops = 2000 > > > filestore_queue_max_bytes = 536870912 > > > filestore_queue_committing_max_ops = 2000 > > > filestore_queue_committing_max_bytes = 536870912 > > > osd_max_backfills = 1 > > > > Just drop everything from here... > > > > > > > > [osd.0] > > > host = ceph01 > > > > > > [osd.1] > > > host = ceph01 > > > > > > [osd.2] > > > host = ceph02 > > > > > > [osd.3] > > > host = ceph02 > > > > > > [osd.4] > > > host = ceph03 > > > > > > [osd.5] > > > host = ceph03 > > > > > > [mon.a] > > > host = ceph01 > > > > > > [mon.b] > > > host = ceph02 > > > > > > [mon.c] > > > host = ceph03 > > > > ...to here! > > > > sage > > > > > > > > > > > > Cheers, > > > S?bastien > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy and bugs 5195/5205: mon.host1 does not exist in monmap, will attempt to join an existing cluster
Hi Sage, On 25 July 2013 17:21, Sage Weil wrote: > I suspect the difference here is that the dns names you are specifying in > ceph-deploy new do not match. Aha, this could well be the problem. The current DNS names resolve to the address bound to an interface that is intended to be used mostly for things like monitoring and SSH, not the actual storage. There is a separate subnet for the hypervisors to talk to the cluster on (i.e. what Ceph considers the "public network"), and one for the OSDs to talk about OSD stuff on ("cluster network"). > Are you adjusting the 'mon host' line in > ceph.conf? Note that you can specify a fqdn to ceph-deploy new and it > will take the first name to be the hostname, or you can specify > 'ceph-deploy new name:fqdn_or_ip'. I had been changing the "mon host" line to the current "public network" IP addresses; re-running "ceph-deploy new" with the "public network" IPs generates an identical config file (as far as I can tell) and still fails with the same assertion failure. Thanks, Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] testing ceph - very slow write performances
yes, those drives are horrible, and you have them partitioned etc. - don't use MDADM for Ceph OSDs, in my experience it *does* impair performance, it just doesn't play nice with OSDs. -- Ceph does its own block replication - though be careful, a size of "2" is not necessarily as "safe" as raid10 (lose any 2 drives vs. lose 2 specific drives) - For each write, it's going to write to Ceph's journal, then that OSD is going to ensure that each write is synced to other journals (depending on how many copies you have etc) - BEFORE it returns (latency!) If it is just a test run : try dedicating a drive to the OSD, and a drive to the OS. To see the impact of not having SSD journals, or latency on second writes - try setting replication size to 1 (not great/ideal - but gives you an idea of how much that extra sync write for the replicated writes is having on performance etc). Ceph really really shines when it has solid state for its write journalling. The black caviar drives are not fantastic for latency either, that can have a significant impact (particularly for the journal!). \\chris - Original Message - From: "Sébastien RICCIO" To: ceph-users@lists.ceph.com Sent: Thursday, 25 July, 2013 11:27:48 PM Subject: [ceph-users] testing ceph - very slow write performances Hi ceph-users, I'm actually evaluating ceph for a project and I'm getting quite low write performances, so please if you have time reading this post and give me some advices :) My test setup using some free hardware we have laying in our datacenter: Three ceph server nodes, on each one is running a monitor and two OSDs and one client node Hardware of a node: (supermicro stuff) Intel(R) Xeon(R) CPU X3440 @ 2.53GHz (total of 8 logical cores) 2 x Western Digital Caviar Black 1TO (WD1003FBYX-01Y7B0) 32 GB RAM DDR3 2 x Ehernet controller: Intel Corporation 82574L Gigabit Network Connection Hardware of the client: (A dell Blade M610) Dual Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (total of 16 logical cores) 64 GB RAM DDR3 4 x Ethernet controller: Broadcom Corporation NetXtreme II BCM5709S Gigabit Ethernet (rev 20) 2 x Ethernet controller: Broadcom Corporation NetXtreme II BCM57711 10-Gigabit PCIe OS of the server nodes: Ubuntu 12.04.2 LTS Kernel 3.10.0-031000-generic #201306301935 SMP Sun Jun 30 23:36:16 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux OS of the client node: CentOS release 6.4 Kernel 3.10.1-1.el6xen.x86_64 #1 SMP Sun Jul 14 11:05:42 EST 2013 x86_64 x86_64 x86_64 GNU/Linux How I did setup the OS (server nodes): I know this isn't good but as there is only two disk in the machine I've partitionned the disks and used them both for the OS and the OSDs, but well for a test run it shouldn't be that bad... Disks layout: partition 1: mdadm raid 1 member for the OS (30gb) partition 2: mdadm raid 1 member for some swapspace (shouldn't be used anyway...) partition 3: reserved for xfs partition for OSDs Ceph installation: Tried both cuttlefish (0.56) and testing (0.66). Deployed using ceph-deploy from an admin node running on a xenserver 6.2 VM. #ceph-deploy new ceph01 ceph02 ceph03 (edited some ceph.conf stuff) #ceph-deploy install --stable cuttlefish ceph01 ceph02 ceph03 #ceph-deploy mon create ceph01 ceph02 ceph03 #ceph-deploy gatherkeys ceph01 #ceph-deploy osd create ceph01:/dev/sda3 ceph01:/dev/sdb3 ceph02:/dev/sda3 ceph02:/dev/sdb3 ceph03:/dev/sda3 ceph03:/dev/sdb3 #ceph-deploy osd activate ceph01:/dev/sda3 ceph01:/dev/sdb3 ceph02:/dev/sda3 ceph02:/dev/sdb3 ceph03:/dev/sda3 ceph03:/dev/sdb3 ceph-admin:~/cephstore$ ceph status health HEALTH_OK monmap e1: 3 mons at {ceph01=10.111.80.1:6789/0,ceph02=10.111.80.2:6789/0,ceph03=10.111.80.3:6789/0}, election epoch 6, quorum 0,1,2 ceph01,ceph02,ceph03 osdmap e26: 6 osds: 6 up, 6 in pgmap v258: 192 pgs: 192 active+clean; 1000 MB data, 62212 MB used, 5346 GB / 5407 GB avail mdsmap e1: 0/0/1 up Now let's do some performance testing from the client, accessing a rbd on the cluster. #rbd create test --size 2 #rbd map test raw write test (ouch something is wrong here) #dd if=/dev/zero of=/dev/rbd1 bs=1024k count=1000 oflag=direct 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 146.051 s, 7.2 MB/s raw read test (this seems quite ok for a gbit network) #dd if=/dev/rbd1 of=/dev/null bs=1024k count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 13.6368 s, 76.9 MB/s Trying to find the bottleneck networking testing between client and nodes (not 100% efficiency but not that bad) [ 3] local 10.111.80.1 port 37497 connected with 10.111.10.105 port 5001 [ 3] 0.0-10.0 sec 812 MBytes 681 Mbits/sec [ 3] local 10.111.80.2 port 55912 connected with 10.111.10.105 port 5001 [ 3] 0.0-10.0 sec 802 MBytes 673 Mbits/sec [ 3] local 10.111.80.3 port 45188 connected with 10.111.10.105 port 5001 [ 3] 0.0-10.1 sec 707 MBytes 589 Mbits/sec [ 3] local 10.111.10.105 port 43103 conn
Re: [ceph-users] Kernel's rbd in 3.10.1
On 07/24/2013 09:37 PM, Mikaël Cluseau wrote: Hi, I have a bug in the 3.10 kernel under debian, be it a self compiled linux-stable from the git (built with make-kpkg) or the sid's package. I'm using format-2 images (ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)) to make snapshots and clones of a database for development purposes. So I have a replay of the database's logs on a ceph volume and I take a snapshots at fixed points in time : mount -> recover database until a given time -> umount -> snapshot -> go back to 1. In both cases, it works for a while (mount/umount cycles) and after some time it gives me the following error on mount : Jul 25 15:20:46 **host** kernel: [14623.808604] [ cut here ] Jul 25 15:20:46 **host** kernel: [14623.808622] kernel BUG at /build/linux-dT6LW0/linux-3.10.1/net/ceph/osd_client.c:2103! Jul 25 15:20:46 **host** kernel: [14623.808641] invalid opcode: [#1] SMP Jul 25 15:20:46 **host** kernel: [14623.808657] Modules linked in: cbc rbd libceph nfsd auth_rpcgss oid_registry nfs_acl nfs lockd sunrpc sha256_generic hmac nls_utf8 cifs dns_resolver fscache bridge stp llc xfs loop coretemp kvm_intel kvm crc32c_intel psmouse serio_raw snd_pcm snd_page_alloc snd_timer snd soundcore iTCO_wdt iTCO_vendor_support i2c_i801 i7core_edac microcode pcspkr lpc_ich mfd_core joydev ioatdma evdev edac_core acpi_cpufreq mperf button processor thermal_sys ext4 crc16 jbd2 mbcache btrfs xor zlib_deflate raid6_pq crc32c libcrc32c raid1 ohci_hcd hid_generic usbhid hid sr_mod sg cdrom sd_mod crc_t10dif dm_mod md_mod ata_generic ata_piix libata uhci_hcd ehci_pci ehci_hcd scsi_mod usbcore usb_common igb i2c_algo_bit i2c_core dca ptp pps_core Jul 25 15:20:46 **host** kernel: [14623.809005] CPU: 6 PID: 9583 Comm: mount Not tainted 3.10-1-amd64 #1 Debian 3.10.1-1 Jul 25 15:20:46 **host** kernel: [14623.809024] Hardware name: Supermicro X8DTU/X8DTU, BIOS 2.1b 12/30/2011 Jul 25 15:20:46 **host** kernel: [14623.809041] task: 88082dfa2840 ti: 88080e2c2000 task.ti: 88080e2c2000 Jul 25 15:20:46 **host** kernel: [14623.809059] RIP: 0010:[] [] ceph_osdc_build_request+0x370/0x3e9 [libceph] Jul 25 15:20:46 **host** kernel: [14623.809092] RSP: 0018:88080e2c39b8 EFLAGS: 00010216 Jul 25 15:20:46 **host** kernel: [14623.809120] RAX: 88082e589a80 RBX: 88082e589b72 RCX: 0007 Jul 25 15:20:46 **host** kernel: [14623.809151] RDX: 88082e589b6f RSI: 88082afd9078 RDI: 88082b308258 Jul 25 15:20:46 **host** kernel: [14623.809182] RBP: 1000 R08: 88082e10a400 R09: 88082afd9000 Jul 25 15:20:46 **host** kernel: [14623.809213] R10: 8806bfb1cd60 R11: 88082d153c01 R12: 88080e88e988 Jul 25 15:20:46 **host** kernel: [14623.809243] R13: 0001 R14: 88080eb874d8 R15: 88080eb875b8 Jul 25 15:20:46 **host** kernel: [14623.809275] FS: 7f2c893b77e0() GS:88083fc4() knlGS: Jul 25 15:20:46 **host** kernel: [14623.809322] CS: 0010 DS: ES: CR0: 8005003b Jul 25 15:20:46 **host** kernel: [14623.809350] CR2: ff600400 CR3: 0006bfbc6000 CR4: 07e0 Jul 25 15:20:46 **host** kernel: [14623.809381] DR0: DR1: DR2: Jul 25 15:20:46 **host** kernel: [14623.809413] DR3: DR6: 0ff0 DR7: 0400 Jul 25 15:20:46 **host** kernel: [14623.809442] Stack: Jul 25 15:20:46 **host** kernel: [14623.814598] 2201 88080e2c3a30 1000 88042edf2240 Jul 25 15:20:46 **host** kernel: [14623.814656] 0024a05cbb01 88082e84f348 88080e2c3a58 Jul 25 15:20:46 **host** kernel: [14623.814710] 88080eb874d8 88080e9aa290 88027abc6000 1000 Jul 25 15:20:46 **host** kernel: [14623.814765] Call Trace: Jul 25 15:20:46 **host** kernel: [14623.814793] [] ? rbd_osd_req_format_write+0x81/0x8c [rbd] Jul 25 15:20:46 **host** kernel: [14623.814827] [] ? rbd_img_request_fill+0x679/0x74f [rbd] Jul 25 15:20:46 **host** kernel: [14623.814865] [] ? should_resched+0x5/0x23 Jul 25 15:20:46 **host** kernel: [14623.814896] [] ? rbd_request_fn+0x180/0x226 [rbd] Jul 25 15:20:46 **host** kernel: [14623.814929] [] ? __blk_run_queue_uncond+0x1e/0x26 Jul 25 15:20:46 **host** kernel: [14623.814960] [] ? blk_queue_bio+0x299/0x2e8 Jul 25 15:20:46 **host** kernel: [14623.814990] [] ? generic_make_request+0x96/0xd5 Jul 25 15:20:46 **host** kernel: [14623.815021] [] ? submit_bio+0x10a/0x13b Jul 25 15:20:46 **host** kernel: [14623.815053] [] ? bio_alloc_bioset+0xd0/0x172 Jul 25 15:20:46 **host** kernel: [14623.815083] [] ? _submit_bh+0x1b7/0x1d4 Jul 25 15:20:46 **host** kernel: [14623.815117] [] ? __sync_dirty_buffer+0x4e/0x7b Jul 25 15:20:46 **host** kernel: [14623.815164] [] ? ext4_commit_super+0x192/0x1db [ext4] Jul 25 15:20:46 **host** kernel: [14623.815206] [] ? ext4_setup_super+0xff/0x146 [ext4] Jul 25 15:20:46 *
Re: [ceph-users] Mounting RBD or CephFS on Ceph-Node?
On 07/23/2013 06:09 AM, Oliver Schulz wrote: Dear Ceph Experts, I remember reading that at least in the past I wasn't recommended to mount Ceph storage on a Ceph cluster node. Given a recent kernel (3.8/3.9) and sufficient CPU and memory resources on the nodes, would it now be safe to * Mount RBD oder CephFS on a Ceph cluster node? This will probably always be unsafe for kernel clients [1] [2]. * Run a VM that is based on RBD storage (libvirt?) and/or mounts CephFS on a Ceph node? Using libvirt/qemu+librbd or ceph-fuse is fine, since they are userspace. Using a kernel client inside a VM would work too. Josh [1] http://wiki.ceph.com/03FAQs/01General_FAQ#How_Can_I_Give_Ceph_a_Try.3F [2] http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/1673 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Update your account information
Title: Update your account information  Update your account information  Dear valued customer To get back into your PayPal account, you'll need to update your account information. It's easy: Click the link below to open a secure browser window. Confirm that you're the owner of the account, and then follow the instructions. confirm all information access your account as normal Link Now  If you didn't ask us for help with your password, let us know right away. Reporting it is important because it helps us prevent fraudsters from stealing your information.     Yours sincerely, PayPal Help Center | Security Center Copyright © 2013 PayPal. All rights reserved. PayPal (Europe) S.à r.l.et Cie, S.C.A. Société en Commandite par Actions Registered office: 22-24 Boulevard Royal, L-2449 Luxemburg RCS Luxemburg B 118 349 PayPal Email ID PP1478 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] add crush rule in one command
Hi folks, Recently, I use puppet to deploy Ceph and integrate Ceph with OpenStack. We put computeand storage together in the same cluster. So nova-compute and OSDs will be in each server. We will create a local pool for each server, and the pool only use the disks of each server. Local pools will be used by Nova for root disk and ephemeral disk. In order to use the local pools, I need add some rules for the local pools to ensure the local pools using only local disks. There is only way to add rule in ceph: 1. ceph osd getcrushmap -o crush-map 2. crushtool -c crush-map.txt -o new-crush-map 3. ceph osd setcrushmap -i new-crush-map If multiple servers simultaneously set crush map(puppet agent will do that), there is the possibility of consistency problems. So if there is an command for adding rule, which will be very convenient. Such as: *ceph osd crush add rule -i new-rule-file* Could I add the command into Ceph? Cheers, -- Rongze Zhu - 朱荣泽 Email: zrz...@gmail.com Blog:http://way4ever.com Weibo: http://weibo.com/metaxen Github: https://github.com/zhurongze ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is this HEALTH_WARN indicating?
Any idea how we tweak this? If I want to keep my ceph node root volume at 85% used, that's my business, man. Thanks. --Greg On Mon, Jul 8, 2013 at 4:27 PM, Mike Bryant wrote: > Run "ceph health detail" and it should give you more information. > (I'd guess an osd or mon has a full hard disk) > > Cheers > Mike > > On 8 July 2013 21:16, Jordi Llonch wrote: >> Hello, >> >> I am testing ceph using ubuntu raring with ceph version 0.61.4 >> (1669132fcfc27d0c0b5e5bb93ade59d147e23404) on 3 virtualbox nodes. >> >> What is this HEALTH_WARN indicating? >> >> # ceph -s >>health HEALTH_WARN >>monmap e3: 3 mons at >> {node1=192.168.56.191:6789/0,node2=192.168.56.192:6789/0,node3=192.168.56.193:6789/0}, >> election epoch 52, quorum 0,1,2 node1,node2,node3 >>osdmap e84: 3 osds: 3 up, 3 in >> pgmap v3209: 192 pgs: 192 active+clean; 460 MB data, 1112 MB used, 135 >> GB / 136 GB avail >>mdsmap e37: 1/1/1 up {0=node3=up:active}, 1 up:standby >> >> >> Thanks, >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > -- > Mike Bryant | Systems Administrator | Ocado Technology > mike.bry...@ocado.com | 01707 382148 | www.ocadotechnology.com > > -- > Notice: This email is confidential and may contain copyright material of > Ocado Limited (the "Company"). Opinions and views expressed in this message > may not necessarily reflect the opinions and views of the Company. > > If you are not the intended recipient, please notify us immediately and > delete all copies of this message. Please note that it is your > responsibility to scan this message for viruses. > > Company reg. no. 3875000. > > Ocado Limited > Titan Court > 3 Bishops Square > Hatfield Business Park > Hatfield > Herts > AL10 9NE > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- \*..+.- --Greg Chavez +//..;}; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Upgrade from 0.61.4 to 0.61.6 mon failed. Upgrade to 0.61.7 mon still failed.
Hi all, 2 days ago, i upgraded one of my mon from 0.61.4 to 0.61.6. The mon failed to start. I checked the mailing list and found reports of mon failed after upgrading to 0.61.6. So I wait for the next release and upgraded the failed mon from 0.61.6 to 0.61.7. My mon still fail to start up. Here is the mon log: root@atlas3-c1:/var/log/ceph# tail -100 /var/log/ceph/ceph-mon.atlas3-c1.log 2013-07-26 10:45:56.782321 7fa7df837700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption 2013-07-26 10:45:56.782329 7fa7df837700 0 -- 172.18.185.73:6789/0 >> 172.18.185.79:6789/0 pipe(0x1c91c80 sd=33 :53442 s=1 pgs=0 cs=0 l=0).failed verifying authorize reply 2013-07-26 10:45:58.781375 7fa7e123c700 4 mon.atlas3-c1@0(probing) e4 probe_timeout 0x1c574b0 2013-07-26 10:45:58.781386 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 bootstrap 2013-07-26 10:45:58.781389 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 unregister_cluster_logger - not registered 2013-07-26 10:45:58.781392 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 cancel_probe_timeout (none scheduled) 2013-07-26 10:45:58.781395 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 reset_sync 2013-07-26 10:45:58.781398 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 reset 2013-07-26 10:45:58.781400 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 cancel_probe_timeout (none scheduled) 2013-07-26 10:45:58.781402 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 timecheck_finish 2013-07-26 10:45:58.781404 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 scrub_reset 2013-07-26 10:45:58.781411 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 cancel_probe_timeout (none scheduled) 2013-07-26 10:45:58.781414 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 reset_probe_timeout 0x1c57440 after 2 seconds 2013-07-26 10:45:58.781424 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 probing other monitors 2013-07-26 10:45:58.781833 7fa7df938700 10 mon.atlas3-c1@0(probing) e4 ms_get_authorizer for mon 2013-07-26 10:45:58.781853 7fa7e696c700 10 mon.atlas3-c1@0(probing) e4 ms_get_authorizer for mon 2013-07-26 10:45:58.782037 7fa7dfa39700 10 mon.atlas3-c1@0(probing) e4 ms_get_authorizer for mon 2013-07-26 10:45:58.782165 7fa7df837700 10 mon.atlas3-c1@0(probing) e4 ms_get_authorizer for mon 2013-07-26 10:45:58.782171 7fa7df938700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption 2013-07-26 10:45:58.782171 7fa7e696c700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption 2013-07-26 10:45:58.782177 7fa7df938700 0 -- 172.18.185.73:6789/0 >> 172.18.185.78:6789/0 pipe(0x1c91280 sd=33 :40770 s=1 pgs=0 cs=0 l=0).failed verifying authorize reply 2013-07-26 10:45:58.782179 7fa7e696c700 0 -- 172.18.185.73:6789/0 >> 172.18.185.74:6789/0 pipe(0x1c91a00 sd=30 :48828 s=1 pgs=0 cs=0 l=0).failed verifying authorize reply 2013-07-26 10:45:58.782399 7fa7dfa39700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption 2013-07-26 10:45:58.782418 7fa7dfa39700 0 -- 172.18.185.73:6789/0 >> 172.18.185.77:6789/0 pipe(0x1c91780 sd=32 :44505 s=1 pgs=0 cs=0 l=0).failed verifying authorize reply 2013-07-26 10:45:58.782447 7fa7df837700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption 2013-07-26 10:45:58.782455 7fa7df837700 0 -- 172.18.185.73:6789/0 >> 172.18.185.79:6789/0 pipe(0x1c91c80 sd=31 :53445 s=1 pgs=0 cs=0 l=0).failed verifying authorize reply 2013-07-26 10:46:00.733745 7fa7e123c700 11 mon.atlas3-c1@0(probing) e4 tick 2013-07-26 10:46:00.781471 7fa7e123c700 4 mon.atlas3-c1@0(probing) e4 probe_timeout 0x1c57440 2013-07-26 10:46:00.781479 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 bootstrap 2013-07-26 10:46:00.781481 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 unregister_cluster_logger - not registered 2013-07-26 10:46:00.781483 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 cancel_probe_timeout (none scheduled) 2013-07-26 10:46:00.781486 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 reset_sync 2013-07-26 10:46:00.781488 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 reset 2013-07-26 10:46:00.781490 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 cancel_probe_timeout (none scheduled) 2013-07-26 10:46:00.781492 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 timecheck_finish 2013-07-26 10:46:00.781495 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 scrub_reset 2013-07-26 10:46:00.781500 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 cancel_probe_timeout (none scheduled) 2013-07-26 10:46:00.781502 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 reset_probe_timeout 0x1c57590 after 2 seconds 2013-07-26 10:46:00.781511 7fa7e123c700 10 mon.atlas3-c1@0(probing) e4 probing other monitors 2013-07-26 10:46:00.781984 7fa7dfa39700 10 mon.atlas3-c1@0(probing) e4 ms_get_authorizer for mon 2013-07-26 10:46:00.782005 7fa7e696c700 10 mon.atlas3-c1@0(probing) e4 ms_get_authorizer for mon 2013-07-26 10:46:00.782204 7fa7df938700 10 mon.atlas3-c1@0(probing) e4 ms_get_author
Re: [ceph-users] add crush rule in one command
On Thu, Jul 25, 2013 at 7:41 PM, Rongze Zhu wrote: > Hi folks, > > Recently, I use puppet to deploy Ceph and integrate Ceph with OpenStack. We > put computeand storage together in the same cluster. So nova-compute and > OSDs will be in each server. We will create a local pool for each server, > and the pool only use the disks of each server. Local pools will be used by > Nova for root disk and ephemeral disk. Hmm, this is constraining Ceph quite a lot; I hope you've thought about what this means in terms of data availability and even utilization of your storage. :) > In order to use the local pools, I need add some rules for the local pools > to ensure the local pools using only local disks. There is only way to add > rule in ceph: > > ceph osd getcrushmap -o crush-map > crushtool -c crush-map.txt -o new-crush-map > ceph osd setcrushmap -i new-crush-map > > If multiple servers simultaneously set crush map(puppet agent will do that), > there is the possibility of consistency problems. So if there is an command > for adding rule, which will be very convenient. Such as: > > ceph osd crush add rule -i new-rule-file > > Could I add the command into Ceph? We love contributions to Ceph, and this is an obvious hole in our atomic CLI-based CRUSH manipulation which a fix would be welcome for. Please be aware that there was a significant overhaul to the way these commands are processed internally between Cuttlefish and Dumpling-to-be that you'll need to deal with if you want to cross that boundary. I also recommend looking carefully at how we do the individual pool changes and how we handle whole-map injection to make sure the interface you use and the places you do data extraction makes sense. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is this HEALTH_WARN indicating?
On Thu, Jul 25, 2013 at 7:42 PM, Greg Chavez wrote: > Any idea how we tweak this? If I want to keep my ceph node root > volume at 85% used, that's my business, man. There are config options you can set. On the monitors they are "mon osd full ratio" and "mon osd nearfull ratio"; on the OSDs you may (not) want to change "osd failsafe full ratio" and "osd failsafe nearfull ratio". However, you should be *extremely careful* modifying these values. Linux local filesystems don't much like to get this full to begin with, and if you fill up an OSD enough that the local FS starts failing to perform writes your cluster will become extremely unhappy. The OSD works hard to prevent doing permanent damage, but its prevention mechanisms tend to involve stopping all work. You should also consider what happens if the cluster is that full and you lose a node. Recovering from situations where clusters get past these points tends to involve manually moving data and babysitting things for a while; the values are as low as they are in order to provide a safety net in case you actually do hit them. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] add crush rule in one command
On Fri, Jul 26, 2013 at 1:22 PM, Gregory Farnum wrote: > On Thu, Jul 25, 2013 at 7:41 PM, Rongze Zhu > wrote: > > Hi folks, > > > > Recently, I use puppet to deploy Ceph and integrate Ceph with OpenStack. > We > > put computeand storage together in the same cluster. So nova-compute and > > OSDs will be in each server. We will create a local pool for each server, > > and the pool only use the disks of each server. Local pools will be used > by > > Nova for root disk and ephemeral disk. > > Hmm, this is constraining Ceph quite a lot; I hope you've thought > about what this means in terms of data availability and even > utilization of your storage. :) > We also will create global pool for Cinder, the IOPS of global pool will be betther than local pool. The benefit of local pool is reducing the network traffic between servers and Improving the management of storage. We use one same Ceph Gluster for Nova,Cinder,Glance, and create different pools(and diffenrent rules) for them. Maybe it need more testing :) > > > In order to use the local pools, I need add some rules for the local > pools > > to ensure the local pools using only local disks. There is only way to > add > > rule in ceph: > > > > ceph osd getcrushmap -o crush-map > > crushtool -c crush-map.txt -o new-crush-map > > ceph osd setcrushmap -i new-crush-map > > > > If multiple servers simultaneously set crush map(puppet agent will do > that), > > there is the possibility of consistency problems. So if there is an > command > > for adding rule, which will be very convenient. Such as: > > > > ceph osd crush add rule -i new-rule-file > > > > Could I add the command into Ceph? > > We love contributions to Ceph, and this is an obvious hole in our > atomic CLI-based CRUSH manipulation which a fix would be welcome for. > Please be aware that there was a significant overhaul to the way these > commands are processed internally between Cuttlefish and > Dumpling-to-be that you'll need to deal with if you want to cross that > boundary. I also recommend looking carefully at how we do the > individual pool changes and how we handle whole-map injection to make > sure the interface you use and the places you do data extraction makes > sense. :) > Thank you for your quick reply, it is very useful for me :) > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > -- Rongze Zhu - 朱荣泽 Email: zrz...@gmail.com Blog:http://way4ever.com Weibo: http://weibo.com/metaxen Github: https://github.com/zhurongze ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] add crush rule in one command
On Fri, Jul 26, 2013 at 2:27 PM, Rongze Zhu wrote: > > > > On Fri, Jul 26, 2013 at 1:22 PM, Gregory Farnum wrote: > >> On Thu, Jul 25, 2013 at 7:41 PM, Rongze Zhu >> wrote: >> > Hi folks, >> > >> > Recently, I use puppet to deploy Ceph and integrate Ceph with >> OpenStack. We >> > put computeand storage together in the same cluster. So nova-compute and >> > OSDs will be in each server. We will create a local pool for each >> server, >> > and the pool only use the disks of each server. Local pools will be >> used by >> > Nova for root disk and ephemeral disk. >> >> Hmm, this is constraining Ceph quite a lot; I hope you've thought >> about what this means in terms of data availability and even >> utilization of your storage. :) >> > > We also will create global pool for Cinder, the IOPS of global pool will > be betther than local pool. > The benefit of local pool is reducing the network traffic between servers > and Improving the management of storage. We use one same Ceph Gluster for > Nova,Cinder,Glance, and create different pools(and diffenrent rules) for > them. Maybe it need more testing :) > s/Gluster/Cluster/g > > >> >> > In order to use the local pools, I need add some rules for the local >> pools >> > to ensure the local pools using only local disks. There is only way to >> add >> > rule in ceph: >> > >> > ceph osd getcrushmap -o crush-map >> > crushtool -c crush-map.txt -o new-crush-map >> > ceph osd setcrushmap -i new-crush-map >> > >> > If multiple servers simultaneously set crush map(puppet agent will do >> that), >> > there is the possibility of consistency problems. So if there is an >> command >> > for adding rule, which will be very convenient. Such as: >> > >> > ceph osd crush add rule -i new-rule-file >> > >> > Could I add the command into Ceph? >> >> We love contributions to Ceph, and this is an obvious hole in our >> atomic CLI-based CRUSH manipulation which a fix would be welcome for. >> Please be aware that there was a significant overhaul to the way these >> commands are processed internally between Cuttlefish and >> Dumpling-to-be that you'll need to deal with if you want to cross that >> boundary. I also recommend looking carefully at how we do the >> individual pool changes and how we handle whole-map injection to make >> sure the interface you use and the places you do data extraction makes >> sense. :) >> > > Thank you for your quick reply, it is very useful for me :) > > >> -Greg >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> > > > > -- > > Rongze Zhu - 朱荣泽 > Email: zrz...@gmail.com > Blog:http://way4ever.com > Weibo: http://weibo.com/metaxen > Github: https://github.com/zhurongze > -- Rongze Zhu - 朱荣泽 Email: zrz...@gmail.com Blog:http://way4ever.com Weibo: http://weibo.com/metaxen Github: https://github.com/zhurongze ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com