Re: [ceph-users] Issues going from 1 to 3 mons
On 06/24/2013 07:50 PM, Gregory Farnum wrote: > On Mon, Jun 24, 2013 at 10:36 AM, Jeppesen, Nelson > wrote: >> What do you mean ‘bring up the second monitor with enough information’? >> >> >> >> Here are the basic steps I took. It fails on step 4. If I skip step 4, I get >> a number out of range error. >> >> >> >> 1. ceph auth get mon. -o /tmp/auth >> >> 2. ceph mon getmap -o /tmp/map >> >> 3. sudo ceph-mon -i 1 --mkfs --monmap /tmp/map --keyring /tmp/auth >> >> 4. ceph mon add 1 [:] > > What's the failure here? Does it not return, or does it stop working > after that? I'd expect that following it with it does not return. I just ran into the same issue: # ceph-mon -i b --mkfs --monmap /tmp/monmap --keyring /tmp/keyring --public-addr x.y.z.b:6789 ceph-mon: created monfs at /var/lib/ceph/mon/ceph-b for mon.b # ceph mon add b 46.20.16.22:6789 2013-06-25 10:00:25.659006 7f28ec5fa700 0 monclient: hunting for new mon just sits there forever. On mon a I see: # ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status { "name": "a", "rank": 0, "state": "probing", "election_epoch": 1, "quorum": [], "outside_quorum": [ "a"], "extra_probe_peers": [], "monmap": { "epoch": 14, "fsid": "61ebf2c4-5290-4fbb-8a84-bc8797351bf8", "modified": "2013-06-25 10:00:14.004097", "created": "2013-06-24 15:06:08.472355", "mons": [ { "rank": 0, "name": "a", "addr": "46.20.16.21:6789\/0"}, { "rank": 1, "name": "b", "addr": "46.20.16.22:6789\/0"}]}} it seems the docs here: http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ are misleading. I also just can't extend my mon's (cuttlefish 0.61.4) currently, which is bad. ceph-deploy complains about missing a fsid... >> 5. ceph-mon -i 1 --public-addr {ip:port} > > should work... > > Oh, I think I see — mon 1 is starting up and not seeing itself in the > monmap so it then shuts down. You'll need to convince it to turn on > and contact mon.0; I don't remember exactly how to do that (Joao?) but > I think you should be able to find what you need at > http://ceph.com/docs/master/dev/mon-bootstrap > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- DI (FH) Wolfgang Hennerbichler Software Development Unit Advanced Computing Technologies RISC Software GmbH A company of the Johannes Kepler University Linz IT-Center Softwarepark 35 4232 Hagenberg Austria Phone: +43 7236 3343 245 Fax: +43 7236 3343 250 wolfgang.hennerbich...@risc-software.at http://www.risc-software.at ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issues going from 1 to 3 mons
On 06/25/2013 09:06 AM, Wolfgang Hennerbichler wrote: On 06/24/2013 07:50 PM, Gregory Farnum wrote: On Mon, Jun 24, 2013 at 10:36 AM, Jeppesen, Nelson wrote: What do you mean ‘bring up the second monitor with enough information’? Here are the basic steps I took. It fails on step 4. If I skip step 4, I get a number out of range error. 1. ceph auth get mon. -o /tmp/auth 2. ceph mon getmap -o /tmp/map 3. sudo ceph-mon -i 1 --mkfs --monmap /tmp/map --keyring /tmp/auth 4. ceph mon add 1 [:] What's the failure here? Does it not return, or does it stop working after that? I'd expect that following it with it does not return. I just ran into the same issue: # ceph-mon -i b --mkfs --monmap /tmp/monmap --keyring /tmp/keyring --public-addr x.y.z.b:6789 ceph-mon: created monfs at /var/lib/ceph/mon/ceph-b for mon.b # ceph mon add b 46.20.16.22:6789 2013-06-25 10:00:25.659006 7f28ec5fa700 0 monclient: hunting for new mon just sits there forever. On mon a I see: # ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status { "name": "a", "rank": 0, "state": "probing", "election_epoch": 1, "quorum": [], "outside_quorum": [ "a"], "extra_probe_peers": [], "monmap": { "epoch": 14, "fsid": "61ebf2c4-5290-4fbb-8a84-bc8797351bf8", "modified": "2013-06-25 10:00:14.004097", "created": "2013-06-24 15:06:08.472355", "mons": [ { "rank": 0, "name": "a", "addr": "46.20.16.21:6789\/0"}, { "rank": 1, "name": "b", "addr": "46.20.16.22:6789\/0"}]}} What happens when you run the same command for mon.b ? -Joao it seems the docs here: http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ are misleading. I also just can't extend my mon's (cuttlefish 0.61.4) currently, which is bad. ceph-deploy complains about missing a fsid... 5. ceph-mon -i 1 --public-addr {ip:port} should work... Oh, I think I see — mon 1 is starting up and not seeing itself in the monmap so it then shuts down. You'll need to convince it to turn on and contact mon.0; I don't remember exactly how to do that (Joao?) but I think you should be able to find what you need at http://ceph.com/docs/master/dev/mon-bootstrap -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issues going from 1 to 3 mons
On 06/25/2013 11:45 AM, Joao Eduardo Luis wrote: >> On mon a I see: >> >> # ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status >> { "name": "a", >>"rank": 0, >>"state": "probing", >>"election_epoch": 1, >>"quorum": [], >>"outside_quorum": [ >> "a"], >>"extra_probe_peers": [], >>"monmap": { "epoch": 14, >>"fsid": "61ebf2c4-5290-4fbb-8a84-bc8797351bf8", >>"modified": "2013-06-25 10:00:14.004097", >>"created": "2013-06-24 15:06:08.472355", >>"mons": [ >> { "rank": 0, >>"name": "a", >>"addr": "46.20.16.21:6789\/0"}, >> { "rank": 1, >>"name": "b", >>"addr": "46.20.16.22:6789\/0"}]}} > > What happens when you run the same command for mon.b ? # ceph mon add b x.y.z.b:6789 ^Z [1]+ Stopped ceph mon add b x.y.z.b:6789 root@rd-clusternode22:/etc/ceph# bg [1]+ ceph mon add b x.y.x.b:6789 & # 2013-06-25 11:48:56.136659 7f5b419a3700 0 monclient: hunting for new mon # ceph --admin-daemon /run/ceph/ceph-mon.b.asok mon_status connect to /run/ceph/ceph-mon.a.asok failed with (2) No such file or directory it can't be started and isn't running, so I guess that's why we wouldn't get anything from socket back here... > -Joao wogri_risc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Radosgw and Cors
We have a working ceph with rados version 0.61.4, we are try to use some example applications [1,2] to test the direct upload to rados using cors. With a patched boto [3], we are able to get and set xml cors on a bucket, by the way using one of the apps, chrome give us an access control allow origin, honestly for what can i understand from the logs of rados all seems works fine, can you give me some hint? These are rados log using Frantic-S3-Browser [4] and s3staticuploader [5], these really simple app use to test. [1] https://github.com/frc/Frantic-S3-Browser [2] https://github.com/thrashr888/s3staticuploader [3] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-June/002104.html [4] http://pastebin.com/Zmq9gfQ1 [5] http://pastebin.com/6KGzMb5K ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issues going from 1 to 3 mons
On 06/25/2013 10:52 AM, Wolfgang Hennerbichler wrote: On 06/25/2013 11:45 AM, Joao Eduardo Luis wrote: On mon a I see: # ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status { "name": "a", "rank": 0, "state": "probing", "election_epoch": 1, "quorum": [], "outside_quorum": [ "a"], "extra_probe_peers": [], "monmap": { "epoch": 14, "fsid": "61ebf2c4-5290-4fbb-8a84-bc8797351bf8", "modified": "2013-06-25 10:00:14.004097", "created": "2013-06-24 15:06:08.472355", "mons": [ { "rank": 0, "name": "a", "addr": "46.20.16.21:6789\/0"}, { "rank": 1, "name": "b", "addr": "46.20.16.22:6789\/0"}]}} What happens when you run the same command for mon.b ? # ceph mon add b x.y.z.b:6789 ^Z [1]+ Stopped ceph mon add b x.y.z.b:6789 root@rd-clusternode22:/etc/ceph# bg [1]+ ceph mon add b x.y.x.b:6789 & # 2013-06-25 11:48:56.136659 7f5b419a3700 0 monclient: hunting for new mon # ceph --admin-daemon /run/ceph/ceph-mon.b.asok mon_status connect to /run/ceph/ceph-mon.a.asok failed with (2) No such file or directory it can't be started and isn't running, so I guess that's why we wouldn't get anything from socket back here... Wolfgang, can you set 'debug mon = 20', rerun the monitor and then send the log my way so I can take a look at why is the monitor not starting? -Joao -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Replication between 2 datacenter
hi folks, i have a question concerning data replication using the crushmap. Is it possible to write a crushmap to achive a 2 times 2 replcation in the way a have a pool replication in one data center and an overall replication of this in the backup datacenter? Best regards Joachim ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] increasing stability
On 05/30/2013 11:06 PM, Sage Weil wrote: > Hi everyone, Hi again, > I wanted to mention just a few things on this thread. Thank you for taking the time. > The first is obvious: we are extremely concerned about stability. > However, Ceph is a big project with a wide range of use cases, and it is > difficult to cover them all. For that reason, Inktank is (at least for > the moment) focusing in specific areas (rados, librbd, rgw) and certain > platforms. We have a number of large production customers and > non-customers now who have stable environments, and we are committed to a > solid experience for them. And I really appreciate that. > We are investing heavily in testing infrastructure and automation tools to > maximize our ability to test with limited resources. Our lab is currently > around 14 racks, with most of the focus now on utilizing those resources > as effectively as possible. The teuthology testing framework continues to > evolve and our test suites continue to grow. Unfortunatley, this has been > an area where it has been difficult for others to contribute. We are > eager to talk to anyone who is interested in helping. what we as a community can do is provide feedback with our test-cases. and I think you're doing a great job of supporting the community. > Overall, the cuttlefish release has gone much more smoothly than bobtail > did. That said, there are a few lingering problems, particularly with the > monitor's use of leveldb. We're waiting on some QA on the pending fixes > now before we push out a 0.61.3 that I believe will resolve the remaining > problems for most users. I upgraded to 0.61.4 on a production system today, and it went all smooth. I was really nervous things could blow up. I can't add monitors though. I have another thread going on, so don't bother. What I want to say is: This needs to work. In my mind the mon issues must all be fixed. If I were Inktank I would freeze all further features, and fix all bugs (I know this is boring, but business-critical) until ceph gets so stable that there are no more complaints by users. You are so close. Right now when I promote ceph and people ask me: but is it stable? I still have to say: It's almost there. > However, as overall adoption of ceph increases, we move past the critical > bugs and start seeing a larger number of "long-tail" issues that affect > smaller sets of users. Overall this is a good thing, even if it means a > harder job for the engineers to triage and track down obscure problems. I realize this is very hard, and maybe very boring. > The mailing list is going to attract a high number of bug reports because > that's what it is for. Although we believe the quality is getting better > based on our internal testing and our commercial interactions, we'd like > to turn this into a more metrics driven analysis. We welcome any ideas on > how to do this, as the obvious ideas (like counting bugs) tend to scale > with the number of users, and we have no way of telling how many users > there really are. I really want to see you succeed big time. Ceph is one of the best things that have come to my mind since a long time. I don't want to tell you what to do, because you will know it better than me. All I am saying is: If you make it very robust, people will not stop buying support contracts. > Thanks- > sage Thank you, sage. We all owe you more than a 'thank you'. wogri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issues going from 1 to 3 mons
(Re-adding the list for future reference) Wolfgang, from your log file: 2013-06-25 14:58:39.739392 7fa329698780 -1 common/config.cc: In function 'void md_config_t::set_val_or_die(const char*, const char*)' thread 7fa329698780 time 2013-06-25 14:58:39.738501 common/config.cc: 621: FAILED assert(ret == 0) ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404) 1: /usr/bin/ceph-mon() [0x660736] 2: /usr/bin/ceph-mon() [0x699d66] 3: (pick_addresses(CephContext*)+0x93) [0x69a1a3] 4: (main()+0x1e3f) [0x48256f] 5: (__libc_start_main()+0xed) [0x7fa3278f576d] 6: /usr/bin/ceph-mon() [0x4848bd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. This was initially reported on ticket #5205. Sage fixed it last night, for ticket #5195. Gary reports it fixed using Sage's patch, and said fix was backported to the cuttlefish branch. It's worth to mention that the cuttlefish branch also contains a couple of commits that should boost monitor performance and avoid leveldb hangups. Looking into #5195 (http://tracker.ceph.com/issues/5195) for more info is advised. Let us know if you decide to try the cuttlefish branch (on the monitors) and whether it fixes the issue for you. Thanks! -Joao -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Drive replacement procedure
Sorry, I forgot to mention ceph osd set noout. Sébastien Han wrote a blog post about it. http://www.sebastien-han.fr/blog/2012/08/17/ceph-storage-node-maintenance/ Dave Spano Optogenics - Original Message - From: "Michael Lowe" To: "Nigel Williams" Cc: ceph-users@lists.ceph.com Sent: Monday, June 24, 2013 7:41:02 PM Subject: Re: [ceph-users] Drive replacement procedure That's where 'ceph osd set noout' comes in handy. On Jun 24, 2013, at 7:28 PM, Nigel Williams wrote: > On 25/06/2013 5:59 AM, Brian Candler wrote: >> On 24/06/2013 20:27, Dave Spano wrote: >>> Here's my procedure for manually adding OSDs. > > The other thing I discovered is not to wait between steps; some changes > result in a new crushmap, that then triggers replication. You want to speed > through the steps so the cluster does not waste time moving objects around to > meet the replica requirements until you have finished crushmap changes. > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [WRN] 1 slow requests on VM with large disks
Hi, at the moment I have a little problem with when it comes to a remapping of the ceph file system. In VM's with large disk (4 TB each) it comes to freeze the operating system. The freeze is always accompanied with the message "[WRN] 1 slow requests". At the moment, bobtail is installed. Does anyone have an idea how I can avoid the freeze of the OS. Thanks and Regards Harald Roessler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replication between 2 datacenter
On Tue, 25 Jun 2013, joachim.t...@gad.de wrote: > hi folks, > > i have a question concerning data replication using the crushmap. > > Is it possible to write a crushmap to achive a 2 times 2 replcation in the > way a have a pool replication in one data center and an overall replication > of this in the backup datacenter? Do you mean 2 replicas in datacenter A, and 2 more replicas in datacenter B? Short answer: yes, but replication is synchronous, so it will generally only work well if the latency is low between the two sites. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] v0.65 released
Our next development release v0.65 is out, with a few big changes. First and foremost, this release includes a complete revamp of the architecture for the command line interface in order to lay the groundwork for our ongoing REST management API work. The 'ceph' command line tool is now a thin python wrapper around librados. Note that this set of changes includes several small incompatible changes in the interface that tools or scripts utilizing the CLI should be aware of; these are detailed in the complete release notes. Other notable changes: * mon, ceph: huge revamp of CLI and internal admin API. (Dan Mick) * mon: new capability syntax * osd: do not use fadvise(DONTNEED) on XFS (data corruption on power cycle) * osd: recovery and peering performance improvements * osd: new writeback throttling (for less bursty write performance) (Sam Just) * osd: ping/heartbeat on public and private interfaces * osd: avoid osd flapping from asymmetric network failure * osd: re-use partially deleted PG contents when present (Sam Just) * osd: break blacklisted client watches (David Zafman) * mon: many stability fixes (Joao Luis) * mon, osd: many memory leaks fixed * mds: misc stability fixes (Yan, Zheng, Greg Farnum) * mds: many backpointer improvements (Yan, Zheng) * mds: new robust open-by-ino support (Yan, Zheng) * ceph-fuse, libcephfs: fix a few caps revocation bugs * librados: new calls to administer the cluster * librbd: locking tests (Josh Durgin) * ceph-disk: improved handling of odd device names * ceph-disk: many fixes for RHEL/CentOS, Fedora, wheezy * many many fixes from static code analysis (Danny Al-Gaaf) * daemons: create /var/run/ceph as needed The complete release notes, including upgrade notes, can be found at: http://ceph.com/docs/master/release-notes/#v0-65 We have one more sprint to go before the Dumpling feature freeze. Big items include monitor performance and stability improvements and multi-site and disaster recovery features for radosgw. Lots of radosgw has already appeard in rgw-next but these changes will not land until v0.67. You can get v0.65 from the usual locations: * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.65.tar.gz * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian * For RPMs, see http://ceph.com/docs/master/install/rpm ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Empty osd and crushmap after mon restart?
Hi, I'm not sure what happened, but on a Ceph cluster I noticed that the monitors (running 0.61) started filling up the disks, so they were restarted with: mon compact on start = true After a restart the osdmap was empty, it showed: osdmap e2: 0 osds: 0 up, 0 in pgmap v624077: 15296 pgs: 15296 stale+active+clean; 78104 MB data, 243 GB used, 66789 GB / 67032 GB avail mdsmap e1: 0/0/1 up This cluster has 36 OSDs over 9 hosts, but suddenly that was all gone. I also checked the crushmap, all 36 OSDs were removed, no trace of them. "ceph auth list" still showed their keys though. Restarting the OSDs didn't help, since create-or-move complained that the OSDs didn't exist and didn't do anything. I ran "ceph osd create" to get the 36 OSDs created again, but when the OSDs boot they never start working. The only thing they log is: 2013-06-26 01:00:08.852410 7f17f3f16700 0 -- 0.0.0.0:6801/4767 >> 10.23.24.53:6801/1758 pipe(0x1025fc80 sd=116 :40516 s=1 pgs=0 cs=0 l=0).fault with nothing to send, going to standby The internet connection I'm behind is a 3G connection, so I can't go skimming through the logs with debugging at very high levels, but I'm just wondering what this could be? It's obvious that the monitors filling up probably triggered the problem, but I'm now looking at a way to get the OSDs back up again. In the meantime I upgraded all the nodes to 0.61.4, but that didn't change anything. Any ideas on what this might be and how to resolve it? -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Empty osd and crushmap after mon restart?
Some guesses are inline. On Tue, Jun 25, 2013 at 4:06 PM, Wido den Hollander wrote: > Hi, > > I'm not sure what happened, but on a Ceph cluster I noticed that the > monitors (running 0.61) started filling up the disks, so they were restarted > with: > > mon compact on start = true > > After a restart the osdmap was empty, it showed: > >osdmap e2: 0 osds: 0 up, 0 in > pgmap v624077: 15296 pgs: 15296 stale+active+clean; 78104 MB data, 243 > GB used, 66789 GB / 67032 GB avail >mdsmap e1: 0/0/1 up > > This cluster has 36 OSDs over 9 hosts, but suddenly that was all gone. > > I also checked the crushmap, all 36 OSDs were removed, no trace of them. As you guess, this is probably because the disks filled up. It shouldn't be able to happen but we found an edge case where leveldb falls apart; there's a fix for it in the repository now (asserting that we get back what we just wrote) that Sage can talk more about. Probably both disappeared because the monitor got nothing back when reading in the newest OSD Map, and so it's all empty. > "ceph auth list" still showed their keys though. > > Restarting the OSDs didn't help, since create-or-move complained that the > OSDs didn't exist and didn't do anything. I ran "ceph osd create" to get the > 36 OSDs created again, but when the OSDs boot they never start working. > > The only thing they log is: > > 2013-06-26 01:00:08.852410 7f17f3f16700 0 -- 0.0.0.0:6801/4767 >> > 10.23.24.53:6801/1758 pipe(0x1025fc80 sd=116 :40516 s=1 pgs=0 cs=0 > l=0).fault with nothing to send, going to standby Are they going up and just sitting idle? This is probably because none of their peers are telling them to be responsible for any placement groups on startup. > The internet connection I'm behind is a 3G connection, so I can't go > skimming through the logs with debugging at very high levels, but I'm just > wondering what this could be? > > It's obvious that the monitors filling up probably triggered the problem, > but I'm now looking at a way to get the OSDs back up again. > > In the meantime I upgraded all the nodes to 0.61.4, but that didn't change > anything. > > Any ideas on what this might be and how to resolve it? At a guess, you can go in and grab the last good version of the OSD Map and inject that back into the cluster, then restart the OSDs? If that doesn't work then we'll need to figure out the right way to kick them into being responsible for their stuff. (First, make sure that when you turn them on they are actually connecting to the monitors.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd rm results in osd marked down wrongly with 0.61.3
On Mon, 17 Jun 2013, Sage Weil wrote: > Hi Florian, > > If you can trigger this with logs, we're very eager to see what they say > about this! The http://tracker.ceph.com/issues/5336 bug is open to track > this issue. Downgrading this bug until we hear back. sage > > Thanks! > sage > > > On Thu, 13 Jun 2013, Smart Weblications GmbH - Florian Wiessner wrote: > > > Hi, > > > > Is really no one on the list interrested in fixing this? Or am i the only > > one > > having this kind of bug/problem? > > > > Am 11.06.2013 16:19, schrieb Smart Weblications GmbH - Florian Wiessner: > > > Hi List, > > > > > > i observed that an rbd rm results in some osds mark one osd as > > > down > > > wrongly in cuttlefish. > > > > > > The situation gets even worse if there are more than one rbd rm > > > running > > > in parallel. > > > > > > Please see attached logfiles. The rbd rm command was issued on 20:24:00 > > > via > > > cronjob, 40 seconds later the osd 6 got marked down... > > > > > > > > > ceph osd tree > > > > > > # idweight type name up/down reweight > > > -1 7 pool default > > > -3 7 rack unknownrack > > > -2 1 host node01 > > > 0 1 osd.0 up 1 > > > -4 1 host node02 > > > 1 1 osd.1 up 1 > > > -5 1 host node03 > > > 2 1 osd.2 up 1 > > > -6 1 host node04 > > > 3 1 osd.3 up 1 > > > -7 1 host node06 > > > 5 1 osd.5 up 1 > > > -8 1 host node05 > > > 4 1 osd.4 up 1 > > > -9 1 host node07 > > > 6 1 osd.6 up 1 > > > > > > > > > I have seen some patches to parallelize rbd rm, but i think there must be > > > some > > > other issue, as my clients seem to not be able to do IO when ceph is > > > recovering... I think this has worked better in 0.56.x - there was IO > > > while > > > recovering. > > > > > > I also observed in the log of osd.6 that after heartbeat_map > > > reset_timeout, the > > > osd tries to connect to the other osds, but it retries so fast that you > > > could > > > think this is a DoS attack... > > > > > > > > > Please advise.. > > > > > > > > > -- > > > > Mit freundlichen Gr??en, > > > > Florian Wiessner > > > > Smart Weblications GmbH > > Martinsberger Str. 1 > > D-95119 Naila > > > > fon.: +49 9282 9638 200 > > fax.: +49 9282 9638 205 > > 24/7: +49 900 144 000 00 - 0,99 EUR/Min* > > http://www.smart-weblications.de > > > > -- > > Sitz der Gesellschaft: Naila > > Gesch?ftsf?hrer: Florian Wiessner > > HRB-Nr.: HRB 3840 Amtsgericht Hof > > *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4
Upgrading a cluster from 6.1.3 to 6.1.4 with 3 monitors. Cluster had been successfully upgraded from bobtail to cuttlefish and then from 6.1.2 to 6.1.3. There have been no changes to ceph.conf. Node mon.a upgrade, a,b,c monitors OK after upgrade Node mon.b upgrade a,b monitors OK after upgrade (note that c was not available, even though I hadn't touched it) Node mon.c very slow to install the upgrade, RAM was tight for some reason and mon process was using half the RAM Node mon.c shutdown mon.c Node mon.c performed the upgrade Node mon.c restart ceph - mon.c will not start service ceph start mon.c === mon.c === Starting Ceph mon.c on ceph3... [23992]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf ' Starting ceph-create-keys on ceph3... health HEALTH_WARN 1 mons down, quorum 0,1 a,b monmap e1: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14224, quorum 0,1 a,b osdmap e1342: 18 osds: 18 up, 18 in pgmap v4058788: 5448 pgs: 5447 active+clean, 1 active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB / 47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s mdsmap e1: 0/0/1 up Set debug mon = 20 Nothing going into logs other than assertion--- begin dump of recent events --- 0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal (Aborted) ** in thread 7fd5e81b57c0 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404) 1: /usr/bin/ceph-mon() [0x596fe2] 2: (()+0xf000) [0x7fd5e782] 3: (gsignal()+0x35) [0x7fd5e619fba5] 4: (abort()+0x148) [0x7fd5e61a1358] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d] 6: (()+0x5eeb6) [0x7fd5e6a97eb6] 7: (()+0x5eee3) [0x7fd5e6a97ee3] 8: (()+0x5f10e) [0x7fd5e6a9810e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x64a6aa] 10: /usr/bin/ceph-mon() [0x65f916] 11: /usr/bin/ceph-mon() [0x6960e9] 12: (pick_addresses(CephContext*)+0x8d) [0x69624d] 13: (main()+0x1a8a) [0x49786a] 14: (__libc_start_main()+0xf5) [0x7fd5e618ba05] 15: /usr/bin/ceph-mon() [0x499a69] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 20/20 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mon.c.log --- end dump of recent events --- The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to6.1.4
Look like the same error I reported yesterday. Sage is looking at it ? -- Original -- From: "Darryl Bond"; Date: Wed, Jun 26, 2013 10:34 AM To: "ceph-users@lists.ceph.com"; Subject: [ceph-users] One monitor won't start after upgrade from 6.1.3 to6.1.4 Upgrading a cluster from 6.1.3 to 6.1.4 with 3 monitors. Cluster had been successfully upgraded from bobtail to cuttlefish and then from 6.1.2 to 6.1.3. There have been no changes to ceph.conf. Node mon.a upgrade, a,b,c monitors OK after upgrade Node mon.b upgrade a,b monitors OK after upgrade (note that c was not available, even though I hadn't touched it) Node mon.c very slow to install the upgrade, RAM was tight for some reason and mon process was using half the RAM Node mon.c shutdown mon.c Node mon.c performed the upgrade Node mon.c restart ceph - mon.c will not start service ceph start mon.c === mon.c === Starting Ceph mon.c on ceph3... [23992]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf ' Starting ceph-create-keys on ceph3... health HEALTH_WARN 1 mons down, quorum 0,1 a,b monmap e1: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14224, quorum 0,1 a,b osdmap e1342: 18 osds: 18 up, 18 in pgmap v4058788: 5448 pgs: 5447 active+clean, 1 active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB / 47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s mdsmap e1: 0/0/1 up Set debug mon = 20 Nothing going into logs other than assertion--- begin dump of recent events --- 0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal (Aborted) ** in thread 7fd5e81b57c0 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404) 1: /usr/bin/ceph-mon() [0x596fe2] 2: (()+0xf000) [0x7fd5e782] 3: (gsignal()+0x35) [0x7fd5e619fba5] 4: (abort()+0x148) [0x7fd5e61a1358] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d] 6: (()+0x5eeb6) [0x7fd5e6a97eb6] 7: (()+0x5eee3) [0x7fd5e6a97ee3] 8: (()+0x5f10e) [0x7fd5e6a9810e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x64a6aa] 10: /usr/bin/ceph-mon() [0x65f916] 11: /usr/bin/ceph-mon() [0x6960e9] 12: (pick_addresses(CephContext*)+0x8d) [0x69624d] 13: (main()+0x1a8a) [0x49786a] 14: (__libc_start_main()+0xf5) [0x7fd5e618ba05] 15: /usr/bin/ceph-mon() [0x499a69] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 20/20 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mon.c.log --- end dump of recent events --- The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com .___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4
Darryl, I've seen this issue a few times recently. I believe Joao was looking into it at one point, but I don't know if it has been resolved (Any news Joao?). Others have run into it too. Look closely at: http://tracker.ceph.com/issues/4999 http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15 I'd recommend you submit this as a bug on the tracker. It sounds like you have reliable quorum between a and b, that's good. The workaround that has worked for me is to remove mon.c, then re-add it. Assuming your monitor leveldb stores aren't too large, the process is rather quick. Follow the instructions at: http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors then http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors - Mike On 6/25/2013 10:34 PM, Darryl Bond wrote: Upgrading a cluster from 6.1.3 to 6.1.4 with 3 monitors. Cluster had been successfully upgraded from bobtail to cuttlefish and then from 6.1.2 to 6.1.3. There have been no changes to ceph.conf. Node mon.a upgrade, a,b,c monitors OK after upgrade Node mon.b upgrade a,b monitors OK after upgrade (note that c was not available, even though I hadn't touched it) Node mon.c very slow to install the upgrade, RAM was tight for some reason and mon process was using half the RAM Node mon.c shutdown mon.c Node mon.c performed the upgrade Node mon.c restart ceph - mon.c will not start service ceph start mon.c === mon.c === Starting Ceph mon.c on ceph3... [23992]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf ' Starting ceph-create-keys on ceph3... health HEALTH_WARN 1 mons down, quorum 0,1 a,b monmap e1: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14224, quorum 0,1 a,b osdmap e1342: 18 osds: 18 up, 18 in pgmap v4058788: 5448 pgs: 5447 active+clean, 1 active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB / 47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s mdsmap e1: 0/0/1 up Set debug mon = 20 Nothing going into logs other than assertion--- begin dump of recent events --- 0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal (Aborted) ** in thread 7fd5e81b57c0 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404) 1: /usr/bin/ceph-mon() [0x596fe2] 2: (()+0xf000) [0x7fd5e782] 3: (gsignal()+0x35) [0x7fd5e619fba5] 4: (abort()+0x148) [0x7fd5e61a1358] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d] 6: (()+0x5eeb6) [0x7fd5e6a97eb6] 7: (()+0x5eee3) [0x7fd5e6a97ee3] 8: (()+0x5f10e) [0x7fd5e6a9810e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x64a6aa] 10: /usr/bin/ceph-mon() [0x65f916] 11: /usr/bin/ceph-mon() [0x6960e9] 12: (pick_addresses(CephContext*)+0x8d) [0x69624d] 13: (main()+0x1a8a) [0x49786a] 14: (__libc_start_main()+0xf5) [0x7fd5e618ba05] 15: /usr/bin/ceph-mon() [0x499a69] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 20/20 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mon.c.log --- end dump of recent events --- The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the vie
Re: [ceph-users] One monitor won't start after upgrade from 6.1.3to 6.1.4
FYI. I get the same error with an osd too. -11> 2013-06-25 16:00:37.604042 7f0751f1b700 1 -- 172.18.11.32:6802/1594 <== osd.1 172.18.11.30:0/10964 5300 osd_ping(ping e2200 stamp 2013-06-25 16:00:37.588367) v2 47+0+0 (3462129666 0 0) 0x4a0ce00 con 0x4a094a0 -10> 2013-06-25 16:00:37.604075 7f0751f1b700 1 -- 172.18.11.32:6802/1594 --> 172.18.11.30:0/10964 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:37.588367) v2 -- ?+0 0x47196c0 con 0x4a094a0 -9> 2013-06-25 16:00:37.970605 7f0750e18700 10 monclient: tick -8> 2013-06-25 16:00:37.970615 7f0750e18700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2013-06-25 16:00:07.970614) -7> 2013-06-25 16:00:37.970630 7f0750e18700 10 monclient: renew subs? (now: 2013-06-25 16:00:37.970630; renew after: 2013-06-25 16:02:47.970419) -- no -6> 2013-06-25 16:00:38.626079 7f0751f1b700 1 -- 172.18.11.32:6802/1594 <== osd.9 172.18.11.34:0/1788 4862 osd_ping(ping e2200 stamp 2013-06-25 16:00:38.613584) v2 47+0+0 (4007998759 0 0) 0x4efa540 con 0x4f0c580 -5> 2013-06-25 16:00:38.626117 7f0751f1b700 1 -- 172.18.11.32:6802/1594 --> 172.18.11.34:0/1788 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:38.613584) v2 -- ?+0 0x4a0ce00 con 0x4f0c580 -4> 2013-06-25 16:00:38.640572 7f0751f1b700 1 -- 172.18.11.32:6802/1594 <== osd.0 172.18.11.30:0/10931 5280 osd_ping(ping e2200 stamp 2013-06-25 16:00:38.624922) v2 47+0+0 (350205583 0 0) 0x4acfdc0 con 0x4a09340 -3> 2013-06-25 16:00:38.640606 7f0751f1b700 1 -- 172.18.11.32:6802/1594 --> 172.18.11.30:0/10931 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:38.624922) v2 -- ?+0 0x4efa540 con 0x4a09340 -2> 2013-06-25 16:00:39.304307 7f0751f1b700 1 -- 172.18.11.32:6802/1594 <== osd.1 172.18.11.30:0/10964 5301 osd_ping(ping e2200 stamp 2013-06-25 16:00:39.288581) v2 47+0+0 (4084422642 0 0) 0x93b8c40 con 0x4a094a0 -1> 2013-06-25 16:00:39.304354 7f0751f1b700 1 -- 172.18.11.32:6802/1594 --> 172.18.11.30:0/10964 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:39.288581) v2 -- ?+0 0x4acfdc0 con 0x4a094a0 0> 2013-06-25 16:00:39.829601 7f074e512700 -1 os/FileStore.cc: In function 'int FileStore::lfn_find(coll_t, const hobject_t&, IndexedPath*)' thread 7f074e512700 time 2013-06-25 16:00:39.792543 os/FileStore.cc: 166: FAILED assert(!m_filestore_fail_eio || r != -5) ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404) 1: (FileStore::lfn_find(coll_t, hobject_t const&, std::tr1::shared_ptr*)+0x109) [0x7df319] 2: (FileStore::lfn_stat(coll_t, hobject_t const&, stat*)+0x55) [0x7e1005] 3: (FileStore::stat(coll_t, hobject_t const&, stat*, bool)+0x51) [0x7ef001] 4: (PG::_scan_list(ScrubMap&, std::vector >&, bool, ThreadPool::TPHandle&)+0x3d1) [0x76e391] 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x174) [0x771344] 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x8a6) [0x772076] 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xbd) [0x70f00d] 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68c) [0x8e384c] 9: (ThreadPool::WorkThread::entry()+0x10) [0x8e4af0] 10: (()+0x7f8e) [0x7f0761dc5f8e] 11: (clone()+0x6d) [0x7f0760077e1d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. -- Original -- From: "Mike Dawson"; Date: Wed, Jun 26, 2013 10:50 AM To: "Darryl Bond"; Cc: "ceph-users@lists.ceph.com"; Subject: Re: [ceph-users] One monitor won't start after upgrade from 6.1.3to 6.1.4 Darryl, I've seen this issue a few times recently. I believe Joao was looking into it at one point, but I don't know if it has been resolved (Any news Joao?). Others have run into it too. Look closely at: http://tracker.ceph.com/issues/4999 http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15 I'd recommend you submit this as a bug on the tracker. It sounds like you have reliable quorum between a and b, that's good. The workaround that has worked for me is to remove mon.c, then re-add it. Assuming your monitor leveldb stores aren't too large, the process is rather quick. Follow the instructions at: http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors then http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors - Mike On 6/25/2013 10:34 PM, Darryl Bond wrote: > Upgrading a cluster from 6.1.3 to 6.1.4 with 3 monitors. Cluster had > been successfully upgraded from bobtail to cuttlefish and then from > 6.1.2 to 6.1.3. There have been no changes to ceph.conf. > > Node mon.a upgrade, a,b,c monitors OK after upgrade > Node mon.b upgrade a,b monitors OK after upgrade (note that c
Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4
Thanks for your prompt response. Given that my mon.c /var/lib/ceph/mon/ceph-c is currently populated, should I delete it's contents after removing the monitor and before re-adding it? Darryl On 06/26/13 12:50, Mike Dawson wrote: Darryl, I've seen this issue a few times recently. I believe Joao was looking into it at one point, but I don't know if it has been resolved (Any news Joao?). Others have run into it too. Look closely at: http://tracker.ceph.com/issues/4999 http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15 I'd recommend you submit this as a bug on the tracker. It sounds like you have reliable quorum between a and b, that's good. The workaround that has worked for me is to remove mon.c, then re-add it. Assuming your monitor leveldb stores aren't too large, the process is rather quick. Follow the instructions at: http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors then http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors - Mike On 6/25/2013 10:34 PM, Darryl Bond wrote: Upgrading a cluster from 6.1.3 to 6.1.4 with 3 monitors. Cluster had been successfully upgraded from bobtail to cuttlefish and then from 6.1.2 to 6.1.3. There have been no changes to ceph.conf. Node mon.a upgrade, a,b,c monitors OK after upgrade Node mon.b upgrade a,b monitors OK after upgrade (note that c was not available, even though I hadn't touched it) Node mon.c very slow to install the upgrade, RAM was tight for some reason and mon process was using half the RAM Node mon.c shutdown mon.c Node mon.c performed the upgrade Node mon.c restart ceph - mon.c will not start service ceph start mon.c === mon.c === Starting Ceph mon.c on ceph3... [23992]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf ' Starting ceph-create-keys on ceph3... health HEALTH_WARN 1 mons down, quorum 0,1 a,b monmap e1: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14224, quorum 0,1 a,b osdmap e1342: 18 osds: 18 up, 18 in pgmap v4058788: 5448 pgs: 5447 active+clean, 1 active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB / 47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s mdsmap e1: 0/0/1 up Set debug mon = 20 Nothing going into logs other than assertion--- begin dump of recent events --- 0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal (Aborted) ** in thread 7fd5e81b57c0 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404) 1: /usr/bin/ceph-mon() [0x596fe2] 2: (()+0xf000) [0x7fd5e782] 3: (gsignal()+0x35) [0x7fd5e619fba5] 4: (abort()+0x148) [0x7fd5e61a1358] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d] 6: (()+0x5eeb6) [0x7fd5e6a97eb6] 7: (()+0x5eee3) [0x7fd5e6a97ee3] 8: (()+0x5f10e) [0x7fd5e6a9810e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x64a6aa] 10: /usr/bin/ceph-mon() [0x65f916] 11: /usr/bin/ceph-mon() [0x6960e9] 12: (pick_addresses(CephContext*)+0x8d) [0x69624d] 13: (main()+0x1a8a) [0x49786a] 14: (__libc_start_main()+0xf5) [0x7fd5e618ba05] 15: /usr/bin/ceph-mon() [0x499a69] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4
I've typically moved it off to a non-conflicting path in lieu of deleting it outright, but either way should work. IIRC, I used something like: sudo mv /var/lib/ceph/mon/ceph-c /var/lib/ceph/mon/ceph-c-bak && sudo mkdir /var/lib/ceph/mon/ceph-c - Mike On 6/25/2013 11:08 PM, Darryl Bond wrote: Thanks for your prompt response. Given that my mon.c /var/lib/ceph/mon/ceph-c is currently populated, should I delete it's contents after removing the monitor and before re-adding it? Darryl On 06/26/13 12:50, Mike Dawson wrote: Darryl, I've seen this issue a few times recently. I believe Joao was looking into it at one point, but I don't know if it has been resolved (Any news Joao?). Others have run into it too. Look closely at: http://tracker.ceph.com/issues/4999 http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15 I'd recommend you submit this as a bug on the tracker. It sounds like you have reliable quorum between a and b, that's good. The workaround that has worked for me is to remove mon.c, then re-add it. Assuming your monitor leveldb stores aren't too large, the process is rather quick. Follow the instructions at: http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors then http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors - Mike On 6/25/2013 10:34 PM, Darryl Bond wrote: Upgrading a cluster from 6.1.3 to 6.1.4 with 3 monitors. Cluster had been successfully upgraded from bobtail to cuttlefish and then from 6.1.2 to 6.1.3. There have been no changes to ceph.conf. Node mon.a upgrade, a,b,c monitors OK after upgrade Node mon.b upgrade a,b monitors OK after upgrade (note that c was not available, even though I hadn't touched it) Node mon.c very slow to install the upgrade, RAM was tight for some reason and mon process was using half the RAM Node mon.c shutdown mon.c Node mon.c performed the upgrade Node mon.c restart ceph - mon.c will not start service ceph start mon.c === mon.c === Starting Ceph mon.c on ceph3... [23992]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf ' Starting ceph-create-keys on ceph3... health HEALTH_WARN 1 mons down, quorum 0,1 a,b monmap e1: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14224, quorum 0,1 a,b osdmap e1342: 18 osds: 18 up, 18 in pgmap v4058788: 5448 pgs: 5447 active+clean, 1 active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB / 47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s mdsmap e1: 0/0/1 up Set debug mon = 20 Nothing going into logs other than assertion--- begin dump of recent events --- 0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal (Aborted) ** in thread 7fd5e81b57c0 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404) 1: /usr/bin/ceph-mon() [0x596fe2] 2: (()+0xf000) [0x7fd5e782] 3: (gsignal()+0x35) [0x7fd5e619fba5] 4: (abort()+0x148) [0x7fd5e61a1358] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d] 6: (()+0x5eeb6) [0x7fd5e6a97eb6] 7: (()+0x5eee3) [0x7fd5e6a97ee3] 8: (()+0x5f10e) [0x7fd5e6a9810e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x64a6aa] 10: /usr/bin/ceph-mon() [0x65f916] 11: /usr/bin/ceph-mon() [0x6960e9] 12: (pick_addresses(CephContext*)+0x8d) [0x69624d] 13: (main()+0x1a8a) [0x49786a] 14: (__libc_start_main()+0xf5) [0x7fd5e618ba05] 15: /usr/bin/ceph-mon() [0x499a69] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4
Nope, same outcome. [root@ceph3 mon]# ceph mon remove c removed mon.c at 192.168.6.103:6789/0, there are now 2 monitors [root@ceph3 mon]# mkdir tmp [root@ceph3 mon]# ceph auth get mon. -o tmp/keyring exported keyring for mon. [root@ceph3 mon]# ceph mon getmap -o tmp/monmap 2013-06-26 13:51:26.640097 7ffb48a12700 0 -- :/24748 >> 192.168.6.103:6789/0 pipe(0x1105350 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault got latest monmap [root@ceph3 mon]# ls -l tmp total 8 -rw-r--r--. 1 root root 55 Jun 26 13:51 keyring -rw-r--r--. 1 root root 328 Jun 26 13:51 monmap [root@ceph3 mon]# ceph-mon -i c --mkfs --monmap tmp/monmap --keyring tmp/keyring ceph-mon: created monfs at /var/lib/ceph/mon/ceph-c for mon.c [root@ceph3 mon]# ls ceph-c keyring store.db [root@ceph3 mon]# ceph mon add c 192.168.6.103:6789 mon c 192.168.6.103:6789/0 already exists [root@ceph3 mon]# ceph status 2013-06-26 13:53:58.401436 7f0dd653d700 0 -- :/25695 >> 192.168.6.103:6789/0 pipe(0x108e350 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault health HEALTH_WARN 1 mons down, quorum 0,1 a,b monmap e3: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14228, quorum 0,1 a,b osdmap e1342: 18 osds: 18 up, 18 in pgmap v4060824: 5448 pgs: 5448 active+clean; 5820 GB data, 11673 GB used, 35464 GB / 47137 GB avail; 2983KB/s rd, 1217KB/s wr, 552op/s mdsmap e1: 0/0/1 up [root@ceph3 mon]# service ceph start mon.c === mon.c === Starting Ceph mon.c on ceph3... [25887]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf ' Starting ceph-create-keys on ceph3... [root@ceph3 mon]# ls ceph-c keyring store.db [root@ceph3 mon]# ceph-mon -i c --public-addr 192.168.6.103:6789 [26768]: (33) Numerical argument out of domain On 06/26/13 13:19, Mike Dawson wrote: I've typically moved it off to a non-conflicting path in lieu of deleting it outright, but either way should work. IIRC, I used something like: sudo mv /var/lib/ceph/mon/ceph-c /var/lib/ceph/mon/ceph-c-bak && sudo mkdir /var/lib/ceph/mon/ceph-c - Mike On 6/25/2013 11:08 PM, Darryl Bond wrote: Thanks for your prompt response. Given that my mon.c /var/lib/ceph/mon/ceph-c is currently populated, should I delete it's contents after removing the monitor and before re-adding it? Darryl On 06/26/13 12:50, Mike Dawson wrote: Darryl, I've seen this issue a few times recently. I believe Joao was looking into it at one point, but I don't know if it has been resolved (Any news Joao?). Others have run into it too. Look closely at: http://tracker.ceph.com/issues/4999 http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15 I'd recommend you submit this as a bug on the tracker. It sounds like you have reliable quorum between a and b, that's good. The workaround that has worked for me is to remove mon.c, then re-add it. Assuming your monitor leveldb stores aren't too large, the process is rather quick. Follow the instructions at: http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors then http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors - Mike On 6/25/2013 10:34 PM, Darryl Bond wrote: Upgrading a cluster from 6.1.3 to 6.1.4 with 3 monitors. Cluster had been successfully upgraded from bobtail to cuttlefish and then from 6.1.2 to 6.1.3. There have been no changes to ceph.conf. Node mon.a upgrade, a,b,c monitors OK after upgrade Node mon.b upgrade a,b monitors OK after upgrade (note that c was not available, even though I hadn't touched it) Node mon.c very slow to install the upgrade, RAM was tight for some reason and mon process was using half the RAM Node mon.c shutdown mon.c Node mon.c performed the upgrade Node mon.c restart ceph - mon.c will not start service ceph start mon.c === mon.c === Starting Ceph mon.c on ceph3... [23992]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf ' Starting ceph-create-keys on ceph3... health HEALTH_WARN 1 mons down, quorum 0,1 a,b monmap e1: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14224, quorum 0,1 a,b osdmap e1342: 18 osds: 18 up, 18 in pgmap v4058788: 5448 pgs: 5447 active+clean, 1 active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB / 47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s mdsmap e1: 0/0/1 up Set debug mon = 20 Nothing going into logs other than assertion--- begin dump of recent events --- 0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal (Aborted) ** in thread 7fd5e81b57c0 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
Re: [ceph-users] Issues going from 1 to 3 mons
On Tue, Jun 25, 2013 at 02:24:35PM +0100, Joao Eduardo Luis wrote: > (Re-adding the list for future reference) > > Wolfgang, from your log file: > > 2013-06-25 14:58:39.739392 7fa329698780 -1 common/config.cc: In > function 'void md_config_t::set_val_or_die(const char*, const > char*)' thread 7fa329698780 time 2013-06-25 14:58:39.738501 > common/config.cc: 621: FAILED assert(ret == 0) > > ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404) > 1: /usr/bin/ceph-mon() [0x660736] > 2: /usr/bin/ceph-mon() [0x699d66] > 3: (pick_addresses(CephContext*)+0x93) [0x69a1a3] > 4: (main()+0x1e3f) [0x48256f] > 5: (__libc_start_main()+0xed) [0x7fa3278f576d] > 6: /usr/bin/ceph-mon() [0x4848bd] > NOTE: a copy of the executable, or `objdump -rdS ` is > needed to interpret this. > > This was initially reported on ticket #5205. Sage fixed it last > night, for ticket #5195. Gary reports it fixed using Sage's patch, > and said fix was backported to the cuttlefish branch. > > It's worth to mention that the cuttlefish branch also contains a > couple of commits that should boost monitor performance and avoid > leveldb hangups. > > Looking into #5195 (http://tracker.ceph.com/issues/5195) for more > info is advised. Let us know if you decide to try the cuttlefish > branch (on the monitors) and whether it fixes the issue for you. Hi Joao, thank you for looking into this. I hope to be able to try the latest cuttlefish branch, but currently my time is quite constrained, I can't guarantee it. So I can assume it will be fixed in the next stable release of cuttlefish, which is great. Thank you. > Thanks! > > -Joao Thank you, wogri -- http://www.wogri.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4
Got it going. This helped http://tracker.ceph.com/issues/5205 My ceph.conf has cluster and public addresses defined in global. I commented them out and mon.c started successfully. [global] auth cluster required = cephx auth service required = cephx auth client required = cephx # public network = 192.168.6.0/24 # cluster network = 10.6.0.0/16 # ceph status health HEALTH_OK monmap e3: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14230, quorum 0,1,2 a,b,c osdmap e1538: 18 osds: 17 up, 17 in pgmap v4064405: 5448 pgs: 5447 active+clean, 1 active+clean+scrubbing+deep; 5829 GB data, 11691 GB used, 34989 GB / 46681 GB avail; 328B/s rd, 816KB/s wr, 135op/s mdsmap e1: 0/0/1 up Looks like there is a fix on the way. Darryl On 06/26/13 13:58, Darryl Bond wrote: Nope, same outcome. [root@ceph3 mon]# ceph mon remove c removed mon.c at 192.168.6.103:6789/0, there are now 2 monitors [root@ceph3 mon]# mkdir tmp [root@ceph3 mon]# ceph auth get mon. -o tmp/keyring exported keyring for mon. [root@ceph3 mon]# ceph mon getmap -o tmp/monmap 2013-06-26 13:51:26.640097 7ffb48a12700 0 -- :/24748 >> 192.168.6.103:6789/0 pipe(0x1105350 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault got latest monmap [root@ceph3 mon]# ls -l tmp total 8 -rw-r--r--. 1 root root 55 Jun 26 13:51 keyring -rw-r--r--. 1 root root 328 Jun 26 13:51 monmap [root@ceph3 mon]# ceph-mon -i c --mkfs --monmap tmp/monmap --keyring tmp/keyring ceph-mon: created monfs at /var/lib/ceph/mon/ceph-c for mon.c [root@ceph3 mon]# ls ceph-c keyring store.db [root@ceph3 mon]# ceph mon add c 192.168.6.103:6789 mon c 192.168.6.103:6789/0 already exists [root@ceph3 mon]# ceph status 2013-06-26 13:53:58.401436 7f0dd653d700 0 -- :/25695 >> 192.168.6.103:6789/0 pipe(0x108e350 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault health HEALTH_WARN 1 mons down, quorum 0,1 a,b monmap e3: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14228, quorum 0,1 a,b osdmap e1342: 18 osds: 18 up, 18 in pgmap v4060824: 5448 pgs: 5448 active+clean; 5820 GB data, 11673 GB used, 35464 GB / 47137 GB avail; 2983KB/s rd, 1217KB/s wr, 552op/s mdsmap e1: 0/0/1 up [root@ceph3 mon]# service ceph start mon.c === mon.c === Starting Ceph mon.c on ceph3... [25887]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf ' Starting ceph-create-keys on ceph3... [root@ceph3 mon]# ls ceph-c keyring store.db [root@ceph3 mon]# ceph-mon -i c --public-addr 192.168.6.103:6789 [26768]: (33) Numerical argument out of domain On 06/26/13 13:19, Mike Dawson wrote: I've typically moved it off to a non-conflicting path in lieu of deleting it outright, but either way should work. IIRC, I used something like: sudo mv /var/lib/ceph/mon/ceph-c /var/lib/ceph/mon/ceph-c-bak && sudo mkdir /var/lib/ceph/mon/ceph-c - Mike On 6/25/2013 11:08 PM, Darryl Bond wrote: Thanks for your prompt response. Given that my mon.c /var/lib/ceph/mon/ceph-c is currently populated, should I delete it's contents after removing the monitor and before re-adding it? Darryl On 06/26/13 12:50, Mike Dawson wrote: Darryl, I've seen this issue a few times recently. I believe Joao was looking into it at one point, but I don't know if it has been resolved (Any news Joao?). Others have run into it too. Look closely at: http://tracker.ceph.com/issues/4999 http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15 I'd recommend you submit this as a bug on the tracker. It sounds like you have reliable quorum between a and b, that's good. The workaround that has worked for me is to remove mon.c, then re-add it. Assuming your monitor leveldb stores aren't too large, the process is rather quick. Follow the instructions at: http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors then http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors - Mike On 6/25/2013 10:34 PM, Darryl Bond wrote: Upgrading a cluster from 6.1.3 to 6.1.4 with 3 monitors. Cluster had been successfully upgraded from bobtail to cuttlefish and then from 6.1.2 to 6.1.3. There have been no changes to ceph.conf. Node mon.a upgrade, a,b,c monitors OK after upgrade Node mon.b upgrade a,b monitors OK after upgrade (note that c was not available, even though I hadn't touched it) Node mon.c very slow to install the upgrade, RAM was tight for some reason and mon process was using half the RAM Node mon.c shutdown mon.c Node mon.c performed the upgrade Node mon.c restart ceph - mon.c will not start service ceph start mon.c === mon.c === Starting Ceph mon.c on ceph3... [23992]: (33) Numeric