Re: [ceph-users] Issues going from 1 to 3 mons

2013-06-25 Thread Wolfgang Hennerbichler


On 06/24/2013 07:50 PM, Gregory Farnum wrote:
> On Mon, Jun 24, 2013 at 10:36 AM, Jeppesen, Nelson
>  wrote:
>> What do you mean ‘bring up the second monitor with enough information’?
>>
>>
>>
>> Here are the basic steps I took. It fails on step 4. If I skip step 4, I get
>> a number out of range error.
>>
>>
>>
>> 1.  ceph auth get mon. -o /tmp/auth
>>
>> 2.  ceph mon getmap -o /tmp/map
>>
>> 3.  sudo ceph-mon -i 1 --mkfs --monmap /tmp/map --keyring /tmp/auth
>>
>> 4.  ceph mon add 1 [:]
> 
> What's the failure here? Does it not return, or does it stop working
> after that? I'd expect that following it with

it does not return. I just ran into the same issue:

# ceph-mon -i b --mkfs --monmap /tmp/monmap --keyring /tmp/keyring
--public-addr x.y.z.b:6789
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-b for mon.b
# ceph mon add b 46.20.16.22:6789

2013-06-25 10:00:25.659006 7f28ec5fa700  0 monclient: hunting for new mon

just sits there forever.
On mon a I see:

#  ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status
{ "name": "a",
  "rank": 0,
  "state": "probing",
  "election_epoch": 1,
  "quorum": [],
  "outside_quorum": [
"a"],
  "extra_probe_peers": [],
  "monmap": { "epoch": 14,
  "fsid": "61ebf2c4-5290-4fbb-8a84-bc8797351bf8",
  "modified": "2013-06-25 10:00:14.004097",
  "created": "2013-06-24 15:06:08.472355",
  "mons": [
{ "rank": 0,
  "name": "a",
  "addr": "46.20.16.21:6789\/0"},
{ "rank": 1,
  "name": "b",
  "addr": "46.20.16.22:6789\/0"}]}}

it seems the docs here:
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/
are misleading. I also just can't extend my mon's (cuttlefish 0.61.4)
currently, which is bad. ceph-deploy complains about missing a fsid...

>> 5.  ceph-mon -i 1 --public-addr {ip:port}
> 
> should work...
> 
> Oh, I think I see — mon 1 is starting up and not seeing itself in the
> monmap so it then shuts down. You'll need to convince it to turn on
> and contact mon.0; I don't remember exactly how to do that (Joao?) but
> I think you should be able to find what you need at
> http://ceph.com/docs/master/dev/mon-bootstrap
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbich...@risc-software.at
http://www.risc-software.at
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issues going from 1 to 3 mons

2013-06-25 Thread Joao Eduardo Luis

On 06/25/2013 09:06 AM, Wolfgang Hennerbichler wrote:



On 06/24/2013 07:50 PM, Gregory Farnum wrote:

On Mon, Jun 24, 2013 at 10:36 AM, Jeppesen, Nelson
 wrote:

What do you mean ‘bring up the second monitor with enough information’?



Here are the basic steps I took. It fails on step 4. If I skip step 4, I get
a number out of range error.



1.  ceph auth get mon. -o /tmp/auth

2.  ceph mon getmap -o /tmp/map

3.  sudo ceph-mon -i 1 --mkfs --monmap /tmp/map --keyring /tmp/auth

4.  ceph mon add 1 [:]


What's the failure here? Does it not return, or does it stop working
after that? I'd expect that following it with


it does not return. I just ran into the same issue:

# ceph-mon -i b --mkfs --monmap /tmp/monmap --keyring /tmp/keyring
--public-addr x.y.z.b:6789
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-b for mon.b
# ceph mon add b 46.20.16.22:6789

2013-06-25 10:00:25.659006 7f28ec5fa700  0 monclient: hunting for new mon

just sits there forever.
On mon a I see:

#  ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status
{ "name": "a",
   "rank": 0,
   "state": "probing",
   "election_epoch": 1,
   "quorum": [],
   "outside_quorum": [
 "a"],
   "extra_probe_peers": [],
   "monmap": { "epoch": 14,
   "fsid": "61ebf2c4-5290-4fbb-8a84-bc8797351bf8",
   "modified": "2013-06-25 10:00:14.004097",
   "created": "2013-06-24 15:06:08.472355",
   "mons": [
 { "rank": 0,
   "name": "a",
   "addr": "46.20.16.21:6789\/0"},
 { "rank": 1,
   "name": "b",
   "addr": "46.20.16.22:6789\/0"}]}}


What happens when you run the same command for mon.b ?

  -Joao




it seems the docs here:
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/
are misleading. I also just can't extend my mon's (cuttlefish 0.61.4)
currently, which is bad. ceph-deploy complains about missing a fsid...


5.  ceph-mon -i 1 --public-addr {ip:port}


should work...

Oh, I think I see — mon 1 is starting up and not seeing itself in the
monmap so it then shuts down. You'll need to convince it to turn on
and contact mon.0; I don't remember exactly how to do that (Joao?) but
I think you should be able to find what you need at
http://ceph.com/docs/master/dev/mon-bootstrap
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issues going from 1 to 3 mons

2013-06-25 Thread Wolfgang Hennerbichler


On 06/25/2013 11:45 AM, Joao Eduardo Luis wrote:

>> On mon a I see:
>>
>> #  ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status
>> { "name": "a",
>>"rank": 0,
>>"state": "probing",
>>"election_epoch": 1,
>>"quorum": [],
>>"outside_quorum": [
>>  "a"],
>>"extra_probe_peers": [],
>>"monmap": { "epoch": 14,
>>"fsid": "61ebf2c4-5290-4fbb-8a84-bc8797351bf8",
>>"modified": "2013-06-25 10:00:14.004097",
>>"created": "2013-06-24 15:06:08.472355",
>>"mons": [
>>  { "rank": 0,
>>"name": "a",
>>"addr": "46.20.16.21:6789\/0"},
>>  { "rank": 1,
>>"name": "b",
>>"addr": "46.20.16.22:6789\/0"}]}}
> 
> What happens when you run the same command for mon.b ?

# ceph mon add b x.y.z.b:6789
^Z
[1]+  Stopped ceph mon add b x.y.z.b:6789
root@rd-clusternode22:/etc/ceph# bg
[1]+ ceph mon add b x.y.x.b:6789 &
# 2013-06-25 11:48:56.136659 7f5b419a3700  0 monclient: hunting for new mon

# ceph --admin-daemon /run/ceph/ceph-mon.b.asok mon_status
connect to /run/ceph/ceph-mon.a.asok failed with (2) No such file or
directory

it can't be started and isn't running, so I guess that's why we wouldn't
get anything from socket back here...

>   -Joao

wogri_risc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw and Cors

2013-06-25 Thread Fabio - NS3 srl
We have a working ceph with rados version 0.61.4, we are try to use some 
example applications [1,2] to test the direct upload to rados using cors.


With a patched boto [3], we are able to get and set xml cors on a 
bucket, by the way using one of the apps, chrome give us an access 
control allow origin, honestly for what can i understand from the logs 
of rados all seems works fine, can you give me some hint?


These are rados log using Frantic-S3-Browser [4] and s3staticuploader 
[5], these really simple app use to test.



[1] https://github.com/frc/Frantic-S3-Browser
[2] https://github.com/thrashr888/s3staticuploader
[3] 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-June/002104.html

[4] http://pastebin.com/Zmq9gfQ1
[5] http://pastebin.com/6KGzMb5K
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issues going from 1 to 3 mons

2013-06-25 Thread Joao Eduardo Luis

On 06/25/2013 10:52 AM, Wolfgang Hennerbichler wrote:



On 06/25/2013 11:45 AM, Joao Eduardo Luis wrote:


On mon a I see:

#  ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status
{ "name": "a",
"rank": 0,
"state": "probing",
"election_epoch": 1,
"quorum": [],
"outside_quorum": [
  "a"],
"extra_probe_peers": [],
"monmap": { "epoch": 14,
"fsid": "61ebf2c4-5290-4fbb-8a84-bc8797351bf8",
"modified": "2013-06-25 10:00:14.004097",
"created": "2013-06-24 15:06:08.472355",
"mons": [
  { "rank": 0,
"name": "a",
"addr": "46.20.16.21:6789\/0"},
  { "rank": 1,
"name": "b",
"addr": "46.20.16.22:6789\/0"}]}}


What happens when you run the same command for mon.b ?


# ceph mon add b x.y.z.b:6789
^Z
[1]+  Stopped ceph mon add b x.y.z.b:6789
root@rd-clusternode22:/etc/ceph# bg
[1]+ ceph mon add b x.y.x.b:6789 &
# 2013-06-25 11:48:56.136659 7f5b419a3700  0 monclient: hunting for new mon

# ceph --admin-daemon /run/ceph/ceph-mon.b.asok mon_status
connect to /run/ceph/ceph-mon.a.asok failed with (2) No such file or
directory

it can't be started and isn't running, so I guess that's why we wouldn't
get anything from socket back here...



Wolfgang, can you set 'debug mon = 20', rerun the monitor and then send 
the log my way so I can take a look at why is the monitor not starting?


  -Joao

--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Replication between 2 datacenter

2013-06-25 Thread Joachim . Tork
hi folks,

i have a question concerning data replication using the crushmap.

Is it possible to write a crushmap to achive a 2 times 2 replcation in the
way a have a pool replication in one data center and an overall 
replication
of this in the backup datacenter?

Best regards

Joachim

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] increasing stability

2013-06-25 Thread Wolfgang Hennerbichler


On 05/30/2013 11:06 PM, Sage Weil wrote:
> Hi everyone,

Hi again,

> I wanted to mention just a few things on this thread.

Thank you for taking the time.

> The first is obvious: we are extremely concerned about stability.  
> However, Ceph is a big project with a wide range of use cases, and it is 
> difficult to cover them all.  For that reason, Inktank is (at least for 
> the moment) focusing in specific areas (rados, librbd, rgw) and certain 
> platforms.  We have a number of large production customers and 
> non-customers now who have stable environments, and we are committed to a 
> solid experience for them.

And I really appreciate that.

> We are investing heavily in testing infrastructure and automation tools to 
> maximize our ability to test with limited resources.  Our lab is currently 
> around 14 racks, with most of the focus now on utilizing those resources 
> as effectively as possible.  The teuthology testing framework continues to 
> evolve and our test suites continue to grow.  Unfortunatley, this has been 
> an area where it has been difficult for others to contribute.  We are 
> eager to talk to anyone who is interested in helping.

what we as a community can do is provide feedback with our test-cases.
and I think you're doing a great job of supporting the community.

> Overall, the cuttlefish release has gone much more smoothly than bobtail 
> did.  That said, there are a few lingering problems, particularly with the 
> monitor's use of leveldb.  We're waiting on some QA on the pending fixes 
> now before we push out a 0.61.3 that I believe will resolve the remaining 
> problems for most users.

I upgraded to 0.61.4 on a production system today, and it went all
smooth. I was really nervous things could blow up.
I can't add monitors though. I have another thread going on, so don't
bother. What I want to say is: This needs to work. In my mind the mon
issues must all be fixed. If I were Inktank I would freeze all further
features, and fix all bugs (I know this is boring, but
business-critical) until ceph gets so stable that there are no more
complaints by users. You are so close.
Right now when I promote ceph and people ask me: but is it stable? I
still have to say: It's almost there.

> However, as overall adoption of ceph increases, we move past the critical 
> bugs and start seeing a larger number of "long-tail" issues that affect 
> smaller sets of users.  Overall this is a good thing, even if it means a 
> harder job for the engineers to triage and track down obscure problems. 

I realize this is very hard, and maybe very boring.

> The mailing list is going to attract a high number of bug reports because 
> that's what it is for.  Although we believe the quality is getting better 
> based on our internal testing and our commercial interactions, we'd like 
> to turn this into a more metrics driven analysis.  We welcome any ideas on 
> how to do this, as the obvious ideas (like counting bugs) tend to scale 
> with the number of users, and we have no way of telling how many users 
> there really are.

I really want to see you succeed big time. Ceph is one of the best
things that have come to my mind since a long time. I don't want to tell
you what to do, because you will know it better than me. All I am saying
is: If you make it very robust, people will not stop buying support
contracts.

> Thanks-
> sage

Thank you, sage. We all owe you more than a 'thank you'.
wogri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issues going from 1 to 3 mons

2013-06-25 Thread Joao Eduardo Luis

(Re-adding the list for future reference)

Wolfgang, from your log file:

2013-06-25 14:58:39.739392 7fa329698780 -1 common/config.cc: In function 
'void md_config_t::set_val_or_die(const char*, const char*)' thread 
7fa329698780 time 2013-06-25 14:58:39.738501

common/config.cc: 621: FAILED assert(ret == 0)

 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
 1: /usr/bin/ceph-mon() [0x660736]
 2: /usr/bin/ceph-mon() [0x699d66]
 3: (pick_addresses(CephContext*)+0x93) [0x69a1a3]
 4: (main()+0x1e3f) [0x48256f]
 5: (__libc_start_main()+0xed) [0x7fa3278f576d]
 6: /usr/bin/ceph-mon() [0x4848bd]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


This was initially reported on ticket #5205.  Sage fixed it last night, 
for ticket #5195.  Gary reports it fixed using Sage's patch, and said 
fix was backported to the cuttlefish branch.


It's worth to mention that the cuttlefish branch also contains a couple 
of commits that should boost monitor performance and avoid leveldb hangups.


Looking into #5195 (http://tracker.ceph.com/issues/5195) for more info 
is advised.  Let us know if you decide to try the cuttlefish branch (on 
the monitors) and whether it fixes the issue for you.


Thanks!

  -Joao


--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Drive replacement procedure

2013-06-25 Thread Dave Spano
Sorry, I forgot to mention ceph osd set noout. Sébastien Han wrote a blog post 
about it. 

http://www.sebastien-han.fr/blog/2012/08/17/ceph-storage-node-maintenance/ 


Dave Spano 
Optogenics 


- Original Message -

From: "Michael Lowe"  
To: "Nigel Williams"  
Cc: ceph-users@lists.ceph.com 
Sent: Monday, June 24, 2013 7:41:02 PM 
Subject: Re: [ceph-users] Drive replacement procedure 

That's where 'ceph osd set noout' comes in handy. 



On Jun 24, 2013, at 7:28 PM, Nigel Williams  wrote: 

> On 25/06/2013 5:59 AM, Brian Candler wrote: 
>> On 24/06/2013 20:27, Dave Spano wrote: 
>>> Here's my procedure for manually adding OSDs. 
> 
> The other thing I discovered is not to wait between steps; some changes 
> result in a new crushmap, that then triggers replication. You want to speed 
> through the steps so the cluster does not waste time moving objects around to 
> meet the replica requirements until you have finished crushmap changes. 
> 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [WRN] 1 slow requests on VM with large disks

2013-06-25 Thread Harald Rößler
Hi,

at the moment I have a little problem with when it comes to a remapping
of the ceph file system. In VM's with large disk (4 TB each) it comes to
freeze the operating system. The freeze is always accompanied with the
message "[WRN] 1 slow requests". At the moment, bobtail is installed.
Does anyone have an idea how I can avoid the freeze of the OS.

Thanks and Regards
Harald Roessler
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replication between 2 datacenter

2013-06-25 Thread Sage Weil
On Tue, 25 Jun 2013, joachim.t...@gad.de wrote:
> hi folks,
> 
> i have a question concerning data replication using the crushmap.
> 
> Is it possible to write a crushmap to achive a 2 times 2 replcation in the
> way a have a pool replication in one data center and an overall replication
> of this in the backup datacenter?

Do you mean 2 replicas in datacenter A, and 2 more replicas in datacenter 
B?

Short answer: yes, but replication is synchronous, so it will generally 
only work well if the latency is low between the two sites.

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] v0.65 released

2013-06-25 Thread Sage Weil
Our next development release v0.65 is out, with a few big changes. First 
and foremost, this release includes a complete revamp of the architecture 
for the command line interface in order to lay the groundwork for our 
ongoing REST management API work. The 'ceph' command line tool is now a 
thin python wrapper around librados. Note that this set of changes 
includes several small incompatible changes in the interface that tools or 
scripts utilizing the CLI should be aware of; these are detailed in the 
complete release notes.

Other notable changes:
 * mon, ceph: huge revamp of CLI and internal admin API. (Dan Mick)
 * mon: new capability syntax
 * osd: do not use fadvise(DONTNEED) on XFS (data corruption on power 
   cycle)
 * osd: recovery and peering performance improvements
 * osd: new writeback throttling (for less bursty write performance) (Sam 
   Just)
 * osd: ping/heartbeat on public and private interfaces
 * osd: avoid osd flapping from asymmetric network failure
 * osd: re-use partially deleted PG contents when present (Sam Just)
 * osd: break blacklisted client watches (David Zafman)
 * mon: many stability fixes (Joao Luis)
 * mon, osd: many memory leaks fixed
 * mds: misc stability fixes (Yan, Zheng, Greg Farnum)
 * mds: many backpointer improvements (Yan, Zheng)
 * mds: new robust open-by-ino support (Yan, Zheng)
 * ceph-fuse, libcephfs: fix a few caps revocation bugs
 * librados: new calls to administer the cluster
 * librbd: locking tests (Josh Durgin)
 * ceph-disk: improved handling of odd device names
 * ceph-disk: many fixes for RHEL/CentOS, Fedora, wheezy
 * many many fixes from static code analysis (Danny Al-Gaaf)
 * daemons: create /var/run/ceph as needed

The complete release notes, including upgrade notes, can be found at:

   http://ceph.com/docs/master/release-notes/#v0-65

We have one more sprint to go before the Dumpling feature freeze. Big 
items include monitor performance and stability improvements and 
multi-site and disaster recovery features for radosgw. Lots of radosgw has 
already appeard in rgw-next but these changes will not land until v0.67.

You can get v0.65 from the usual locations:
 * Git at git://github.com/ceph/ceph.git
 * Tarball at http://ceph.com/download/ceph-0.65.tar.gz
 * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian
 * For RPMs, see http://ceph.com/docs/master/install/rpm
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Empty osd and crushmap after mon restart?

2013-06-25 Thread Wido den Hollander

Hi,

I'm not sure what happened, but on a Ceph cluster I noticed that the 
monitors (running 0.61) started filling up the disks, so they were 
restarted with:


mon compact on start = true

After a restart the osdmap was empty, it showed:

   osdmap e2: 0 osds: 0 up, 0 in
pgmap v624077: 15296 pgs: 15296 stale+active+clean; 78104 MB data, 
243 GB used, 66789 GB / 67032 GB avail

   mdsmap e1: 0/0/1 up

This cluster has 36 OSDs over 9 hosts, but suddenly that was all gone.

I also checked the crushmap, all 36 OSDs were removed, no trace of them.

"ceph auth list" still showed their keys though.

Restarting the OSDs didn't help, since create-or-move complained that 
the OSDs didn't exist and didn't do anything. I ran "ceph osd create" to 
get the 36 OSDs created again, but when the OSDs boot they never start 
working.


The only thing they log is:

2013-06-26 01:00:08.852410 7f17f3f16700  0 -- 0.0.0.0:6801/4767 >> 
10.23.24.53:6801/1758 pipe(0x1025fc80 sd=116 :40516 s=1 pgs=0 cs=0 
l=0).fault with nothing to send, going to standby


The internet connection I'm behind is a 3G connection, so I can't go 
skimming through the logs with debugging at very high levels, but I'm 
just wondering what this could be?


It's obvious that the monitors filling up probably triggered the 
problem, but I'm now looking at a way to get the OSDs back up again.


In the meantime I upgraded all the nodes to 0.61.4, but that didn't 
change anything.


Any ideas on what this might be and how to resolve it?

--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Empty osd and crushmap after mon restart?

2013-06-25 Thread Gregory Farnum
Some guesses are inline.

On Tue, Jun 25, 2013 at 4:06 PM, Wido den Hollander  wrote:
> Hi,
>
> I'm not sure what happened, but on a Ceph cluster I noticed that the
> monitors (running 0.61) started filling up the disks, so they were restarted
> with:
>
> mon compact on start = true
>
> After a restart the osdmap was empty, it showed:
>
>osdmap e2: 0 osds: 0 up, 0 in
> pgmap v624077: 15296 pgs: 15296 stale+active+clean; 78104 MB data, 243
> GB used, 66789 GB / 67032 GB avail
>mdsmap e1: 0/0/1 up
>
> This cluster has 36 OSDs over 9 hosts, but suddenly that was all gone.
>
> I also checked the crushmap, all 36 OSDs were removed, no trace of them.

As you guess, this is probably because the disks filled up. It
shouldn't be able to happen but we found an edge case where leveldb
falls apart; there's a fix for it in the repository now (asserting
that we get back what we just wrote) that Sage can talk more about.
Probably both disappeared because the monitor got nothing back when
reading in the newest OSD Map, and so it's all empty.

> "ceph auth list" still showed their keys though.
>
> Restarting the OSDs didn't help, since create-or-move complained that the
> OSDs didn't exist and didn't do anything. I ran "ceph osd create" to get the
> 36 OSDs created again, but when the OSDs boot they never start working.
>
> The only thing they log is:
>
> 2013-06-26 01:00:08.852410 7f17f3f16700  0 -- 0.0.0.0:6801/4767 >>
> 10.23.24.53:6801/1758 pipe(0x1025fc80 sd=116 :40516 s=1 pgs=0 cs=0
> l=0).fault with nothing to send, going to standby

Are they going up and just sitting idle? This is probably because none
of their peers are telling them to be responsible for any placement
groups on startup.

> The internet connection I'm behind is a 3G connection, so I can't go
> skimming through the logs with debugging at very high levels, but I'm just
> wondering what this could be?
>
> It's obvious that the monitors filling up probably triggered the problem,
> but I'm now looking at a way to get the OSDs back up again.
>
> In the meantime I upgraded all the nodes to 0.61.4, but that didn't change
> anything.
>
> Any ideas on what this might be and how to resolve it?

At a guess, you can go in and grab the last good version of the OSD
Map and inject that back into the cluster, then restart the OSDs? If
that doesn't work then we'll need to figure out the right way to kick
them into being responsible for their stuff.
(First, make sure that when you turn them on they are actually
connecting to the monitors.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd rm results in osd marked down wrongly with 0.61.3

2013-06-25 Thread Sage Weil
On Mon, 17 Jun 2013, Sage Weil wrote:
> Hi Florian,
> 
> If you can trigger this with logs, we're very eager to see what they say 
> about this!  The http://tracker.ceph.com/issues/5336 bug is open to track 
> this issue.

Downgrading this bug until we hear back.

sage

> 
> Thanks!
> sage
> 
> 
> On Thu, 13 Jun 2013, Smart Weblications GmbH - Florian Wiessner wrote:
> 
> > Hi,
> > 
> > Is really no one on the list interrested in fixing this? Or am i the only 
> > one
> > having this kind of bug/problem?
> > 
> > Am 11.06.2013 16:19, schrieb Smart Weblications GmbH - Florian Wiessner:
> > > Hi List,
> > > 
> > > i observed that an rbd rm  results in some osds mark one osd as 
> > > down
> > > wrongly in cuttlefish.
> > > 
> > > The situation gets even worse if there are more than one rbd rm  
> > > running
> > > in parallel.
> > > 
> > > Please see attached logfiles. The rbd rm command was issued on 20:24:00 
> > > via
> > > cronjob, 40 seconds later the osd 6 got marked down...
> > > 
> > > 
> > > ceph osd tree
> > > 
> > > # idweight  type name   up/down reweight
> > > -1  7   pool default
> > > -3  7   rack unknownrack
> > > -2  1   host node01
> > > 0   1   osd.0   up  1
> > > -4  1   host node02
> > > 1   1   osd.1   up  1
> > > -5  1   host node03
> > > 2   1   osd.2   up  1
> > > -6  1   host node04
> > > 3   1   osd.3   up  1
> > > -7  1   host node06
> > > 5   1   osd.5   up  1
> > > -8  1   host node05
> > > 4   1   osd.4   up  1
> > > -9  1   host node07
> > > 6   1   osd.6   up  1
> > > 
> > > 
> > > I have seen some patches to parallelize rbd rm, but i think there must be 
> > > some
> > > other issue, as my clients seem to not be able to do IO when ceph is
> > > recovering... I think this has worked better in 0.56.x - there was IO 
> > > while
> > > recovering.
> > > 
> > > I also observed in the log of osd.6 that after heartbeat_map 
> > > reset_timeout, the
> > > osd tries to connect to the other osds, but it retries so fast that you 
> > > could
> > > think this is a DoS attack...
> > > 
> > > 
> > > Please advise..
> > > 
> > 
> > 
> > -- 
> > 
> > Mit freundlichen Gr??en,
> > 
> > Florian Wiessner
> > 
> > Smart Weblications GmbH
> > Martinsberger Str. 1
> > D-95119 Naila
> > 
> > fon.: +49 9282 9638 200
> > fax.: +49 9282 9638 205
> > 24/7: +49 900 144 000 00 - 0,99 EUR/Min*
> > http://www.smart-weblications.de
> > 
> > --
> > Sitz der Gesellschaft: Naila
> > Gesch?ftsf?hrer: Florian Wiessner
> > HRB-Nr.: HRB 3840 Amtsgericht Hof
> > *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4

2013-06-25 Thread Darryl Bond

Upgrading a cluster from 6.1.3 to 6.1.4  with 3 monitors. Cluster had
been successfully upgraded from bobtail to cuttlefish and then from
6.1.2 to 6.1.3. There have been no changes to ceph.conf.

Node mon.a upgrade, a,b,c monitors OK after upgrade
Node mon.b upgrade a,b monitors OK after upgrade (note that c was not
available, even though I hadn't touched it)
Node mon.c very slow to install the upgrade, RAM was tight for some
reason and mon process was using half the RAM
Node mon.c shutdown mon.c
Node mon.c performed the upgrade
Node mon.c restart ceph - mon.c will not start


service ceph start mon.c

=== mon.c ===
Starting Ceph mon.c on ceph3...
[23992]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file
/var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph3...

   health HEALTH_WARN 1 mons down, quorum 0,1 a,b
   monmap e1: 3 mons at
{a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
election epoch 14224, quorum 0,1 a,b
   osdmap e1342: 18 osds: 18 up, 18 in
pgmap v4058788: 5448 pgs: 5447 active+clean, 1
active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB /
47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s
   mdsmap e1: 0/0/1 up

Set debug mon = 20
Nothing going into logs other than assertion--- begin dump of recent
events ---
 0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal
(Aborted) **
 in thread 7fd5e81b57c0

 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
 1: /usr/bin/ceph-mon() [0x596fe2]
 2: (()+0xf000) [0x7fd5e782]
 3: (gsignal()+0x35) [0x7fd5e619fba5]
 4: (abort()+0x148) [0x7fd5e61a1358]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d]
 6: (()+0x5eeb6) [0x7fd5e6a97eb6]
 7: (()+0x5eee3) [0x7fd5e6a97ee3]
 8: (()+0x5f10e) [0x7fd5e6a9810e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x40a) [0x64a6aa]
 10: /usr/bin/ceph-mon() [0x65f916]
 11: /usr/bin/ceph-mon() [0x6960e9]
 12: (pick_addresses(CephContext*)+0x8d) [0x69624d]
 13: (main()+0x1a8a) [0x49786a]
 14: (__libc_start_main()+0xf5) [0x7fd5e618ba05]
 15: /usr/bin/ceph-mon() [0x499a69]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
  20/20 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mon.c.log
--- end dump of recent events ---


The contents of this electronic message and any attachments are intended only 
for the addressee and may contain legally privileged, personal, sensitive or 
confidential information. If you are not the intended addressee, and have 
received this email, any transmission, distribution, downloading, printing or 
photocopying of the contents of this message or attachments is strictly 
prohibited. Any legal privilege or confidentiality attached to this message and 
attachments is not waived, lost or destroyed by reason of delivery to any 
person other than intended addressee. If you have received this message and are 
not the intended addressee you should notify the sender by return email and 
destroy all copies of the message and any attachments. Unless expressly 
attributed, the views expressed in this email do not necessarily represent the 
views of the company.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to6.1.4

2013-06-25 Thread Da Chun
Look like the same error I reported yesterday. Sage is looking at it ?


-- Original --
From:  "Darryl Bond";
Date:  Wed, Jun 26, 2013 10:34 AM
To:  "ceph-users@lists.ceph.com"; 

Subject:  [ceph-users] One monitor won't start after upgrade from 6.1.3 to6.1.4



Upgrading a cluster from 6.1.3 to 6.1.4  with 3 monitors. Cluster had
been successfully upgraded from bobtail to cuttlefish and then from
6.1.2 to 6.1.3. There have been no changes to ceph.conf.

Node mon.a upgrade, a,b,c monitors OK after upgrade
Node mon.b upgrade a,b monitors OK after upgrade (note that c was not
available, even though I hadn't touched it)
Node mon.c very slow to install the upgrade, RAM was tight for some
reason and mon process was using half the RAM
Node mon.c shutdown mon.c
Node mon.c performed the upgrade
Node mon.c restart ceph - mon.c will not start


service ceph start mon.c

=== mon.c ===
Starting Ceph mon.c on ceph3...
[23992]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file
/var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph3...

health HEALTH_WARN 1 mons down, quorum 0,1 a,b
monmap e1: 3 mons at
{a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
election epoch 14224, quorum 0,1 a,b
osdmap e1342: 18 osds: 18 up, 18 in
 pgmap v4058788: 5448 pgs: 5447 active+clean, 1
active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB /
47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s
mdsmap e1: 0/0/1 up

Set debug mon = 20
Nothing going into logs other than assertion--- begin dump of recent
events ---
  0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal
(Aborted) **
  in thread 7fd5e81b57c0

  ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
  1: /usr/bin/ceph-mon() [0x596fe2]
  2: (()+0xf000) [0x7fd5e782]
  3: (gsignal()+0x35) [0x7fd5e619fba5]
  4: (abort()+0x148) [0x7fd5e61a1358]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d]
  6: (()+0x5eeb6) [0x7fd5e6a97eb6]
  7: (()+0x5eee3) [0x7fd5e6a97ee3]
  8: (()+0x5f10e) [0x7fd5e6a9810e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x40a) [0x64a6aa]
  10: /usr/bin/ceph-mon() [0x65f916]
  11: /usr/bin/ceph-mon() [0x6960e9]
  12: (pick_addresses(CephContext*)+0x8d) [0x69624d]
  13: (main()+0x1a8a) [0x49786a]
  14: (__libc_start_main()+0xf5) [0x7fd5e618ba05]
  15: /usr/bin/ceph-mon() [0x499a69]
  NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
   20/20 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent 1
   max_new 1000
   log_file /var/log/ceph/ceph-mon.c.log
--- end dump of recent events ---


The contents of this electronic message and any attachments are intended only 
for the addressee and may contain legally privileged, personal, sensitive or 
confidential information. If you are not the intended addressee, and have 
received this email, any transmission, distribution, downloading, printing or 
photocopying of the contents of this message or attachments is strictly 
prohibited. Any legal privilege or confidentiality attached to this message and 
attachments is not waived, lost or destroyed by reason of delivery to any 
person other than intended addressee. If you have received this message and are 
not the intended addressee you should notify the sender by return email and 
destroy all copies of the message and any attachments. Unless expressly 
attributed, the views expressed in this email do not necessarily represent the 
views of the company.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4

2013-06-25 Thread Mike Dawson

Darryl,

I've seen this issue a few times recently. I believe Joao was looking 
into it at one point, but I don't know if it has been resolved (Any news 
Joao?). Others have run into it too. Look closely at:


http://tracker.ceph.com/issues/4999
http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15

I'd recommend you submit this as a bug on the tracker.

It sounds like you have reliable quorum between a and b, that's good. 
The workaround that has worked for me is to remove mon.c, then re-add 
it. Assuming your monitor leveldb stores aren't too large, the process 
is rather quick. Follow the instructions at:


http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors

then

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors

- Mike


On 6/25/2013 10:34 PM, Darryl Bond wrote:

Upgrading a cluster from 6.1.3 to 6.1.4  with 3 monitors. Cluster had
been successfully upgraded from bobtail to cuttlefish and then from
6.1.2 to 6.1.3. There have been no changes to ceph.conf.

Node mon.a upgrade, a,b,c monitors OK after upgrade
Node mon.b upgrade a,b monitors OK after upgrade (note that c was not
available, even though I hadn't touched it)
Node mon.c very slow to install the upgrade, RAM was tight for some
reason and mon process was using half the RAM
Node mon.c shutdown mon.c
Node mon.c performed the upgrade
Node mon.c restart ceph - mon.c will not start


service ceph start mon.c

=== mon.c ===
Starting Ceph mon.c on ceph3...
[23992]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file
/var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph3...

health HEALTH_WARN 1 mons down, quorum 0,1 a,b
monmap e1: 3 mons at
{a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
election epoch 14224, quorum 0,1 a,b
osdmap e1342: 18 osds: 18 up, 18 in
 pgmap v4058788: 5448 pgs: 5447 active+clean, 1
active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB /
47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s
mdsmap e1: 0/0/1 up

Set debug mon = 20
Nothing going into logs other than assertion--- begin dump of recent
events ---
  0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal
(Aborted) **
  in thread 7fd5e81b57c0

  ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
  1: /usr/bin/ceph-mon() [0x596fe2]
  2: (()+0xf000) [0x7fd5e782]
  3: (gsignal()+0x35) [0x7fd5e619fba5]
  4: (abort()+0x148) [0x7fd5e61a1358]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d]
  6: (()+0x5eeb6) [0x7fd5e6a97eb6]
  7: (()+0x5eee3) [0x7fd5e6a97ee3]
  8: (()+0x5f10e) [0x7fd5e6a9810e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x40a) [0x64a6aa]
  10: /usr/bin/ceph-mon() [0x65f916]
  11: /usr/bin/ceph-mon() [0x6960e9]
  12: (pick_addresses(CephContext*)+0x8d) [0x69624d]
  13: (main()+0x1a8a) [0x49786a]
  14: (__libc_start_main()+0xf5) [0x7fd5e618ba05]
  15: /usr/bin/ceph-mon() [0x499a69]
  NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
   20/20 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent 1
   max_new 1000
   log_file /var/log/ceph/ceph-mon.c.log
--- end dump of recent events ---


The contents of this electronic message and any attachments are intended
only for the addressee and may contain legally privileged, personal,
sensitive or confidential information. If you are not the intended
addressee, and have received this email, any transmission, distribution,
downloading, printing or photocopying of the contents of this message or
attachments is strictly prohibited. Any legal privilege or
confidentiality attached to this message and attachments is not waived,
lost or destroyed by reason of delivery to any person other than
intended addressee. If you have received this message and are not the
intended addressee you should notify the sender by return email and
destroy all copies of the message and any attachments. Unless expressly
attributed, the vie

Re: [ceph-users] One monitor won't start after upgrade from 6.1.3to 6.1.4

2013-06-25 Thread Da Chun
FYI. I get the same error with an osd too.


   -11> 2013-06-25 16:00:37.604042 7f0751f1b700  1 -- 172.18.11.32:6802/1594 
<== osd.1 172.18.11.30:0/10964 5300  osd_ping(ping e2200 stamp 2013-06-25 
16:00:37.588367) v2  47+0+0 (3462129666 0 0) 0x4a0ce00 con 0x4a094a0
   -10> 2013-06-25 16:00:37.604075 7f0751f1b700  1 -- 172.18.11.32:6802/1594 
--> 172.18.11.30:0/10964 -- osd_ping(ping_reply e2200 stamp 2013-06-25 
16:00:37.588367) v2 -- ?+0 0x47196c0 con 0x4a094a0
-9> 2013-06-25 16:00:37.970605 7f0750e18700 10 monclient: tick
-8> 2013-06-25 16:00:37.970615 7f0750e18700 10 monclient: 
_check_auth_rotating renewing rotating keys (they expired before 2013-06-25 
16:00:07.970614)
-7> 2013-06-25 16:00:37.970630 7f0750e18700 10 monclient: renew subs? (now: 
2013-06-25 16:00:37.970630; renew after: 2013-06-25 16:02:47.970419) -- no
-6> 2013-06-25 16:00:38.626079 7f0751f1b700  1 -- 172.18.11.32:6802/1594 
<== osd.9 172.18.11.34:0/1788 4862  osd_ping(ping e2200 stamp 2013-06-25 
16:00:38.613584) v2  47+0+0 (4007998759 0 0) 0x4efa540 con 0x4f0c580
-5> 2013-06-25 16:00:38.626117 7f0751f1b700  1 -- 172.18.11.32:6802/1594 
--> 172.18.11.34:0/1788 -- osd_ping(ping_reply e2200 stamp 2013-06-25 
16:00:38.613584) v2 -- ?+0 0x4a0ce00 con 0x4f0c580
-4> 2013-06-25 16:00:38.640572 7f0751f1b700  1 -- 172.18.11.32:6802/1594 
<== osd.0 172.18.11.30:0/10931 5280  osd_ping(ping e2200 stamp 2013-06-25 
16:00:38.624922) v2  47+0+0 (350205583 0 0) 0x4acfdc0 con 0x4a09340
-3> 2013-06-25 16:00:38.640606 7f0751f1b700  1 -- 172.18.11.32:6802/1594 
--> 172.18.11.30:0/10931 -- osd_ping(ping_reply e2200 stamp 2013-06-25 
16:00:38.624922) v2 -- ?+0 0x4efa540 con 0x4a09340
-2> 2013-06-25 16:00:39.304307 7f0751f1b700  1 -- 172.18.11.32:6802/1594 
<== osd.1 172.18.11.30:0/10964 5301  osd_ping(ping e2200 stamp 2013-06-25 
16:00:39.288581) v2  47+0+0 (4084422642 0 0) 0x93b8c40 con 0x4a094a0
-1> 2013-06-25 16:00:39.304354 7f0751f1b700  1 -- 172.18.11.32:6802/1594 
--> 172.18.11.30:0/10964 -- osd_ping(ping_reply e2200 stamp 2013-06-25 
16:00:39.288581) v2 -- ?+0 0x4acfdc0 con 0x4a094a0
 0> 2013-06-25 16:00:39.829601 7f074e512700 -1 os/FileStore.cc: In function 
'int FileStore::lfn_find(coll_t, const hobject_t&, IndexedPath*)' thread 
7f074e512700 time 2013-06-25 16:00:39.792543
os/FileStore.cc: 166: FAILED assert(!m_filestore_fail_eio || r != -5)


 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
 1: (FileStore::lfn_find(coll_t, hobject_t const&, 
std::tr1::shared_ptr*)+0x109) [0x7df319]
 2: (FileStore::lfn_stat(coll_t, hobject_t const&, stat*)+0x55) [0x7e1005]
 3: (FileStore::stat(coll_t, hobject_t const&, stat*, bool)+0x51) [0x7ef001]
 4: (PG::_scan_list(ScrubMap&, std::vector 
>&, bool, ThreadPool::TPHandle&)+0x3d1) [0x76e391]
 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, 
ThreadPool::TPHandle&)+0x174) [0x771344]
 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x8a6) [0x772076]
 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xbd) 
[0x70f00d]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68c) [0x8e384c]
 9: (ThreadPool::WorkThread::entry()+0x10) [0x8e4af0]
 10: (()+0x7f8e) [0x7f0761dc5f8e]
 11: (clone()+0x6d) [0x7f0760077e1d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.





-- Original --
From:  "Mike Dawson";
Date:  Wed, Jun 26, 2013 10:50 AM
To:  "Darryl Bond"; 
Cc:  "ceph-users@lists.ceph.com"; 
Subject:  Re: [ceph-users] One monitor won't start after upgrade from 6.1.3to 
6.1.4



Darryl,

I've seen this issue a few times recently. I believe Joao was looking 
into it at one point, but I don't know if it has been resolved (Any news 
Joao?). Others have run into it too. Look closely at:

http://tracker.ceph.com/issues/4999
http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15

I'd recommend you submit this as a bug on the tracker.

It sounds like you have reliable quorum between a and b, that's good. 
The workaround that has worked for me is to remove mon.c, then re-add 
it. Assuming your monitor leveldb stores aren't too large, the process 
is rather quick. Follow the instructions at:

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors

then

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors

- Mike


On 6/25/2013 10:34 PM, Darryl Bond wrote:
> Upgrading a cluster from 6.1.3 to 6.1.4  with 3 monitors. Cluster had
> been successfully upgraded from bobtail to cuttlefish and then from
> 6.1.2 to 6.1.3. There have been no changes to ceph.conf.
>
> Node mon.a upgrade, a,b,c monitors OK after upgrade
> Node mon.b upgrade a,b monitors OK after upgrade (note that c

Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4

2013-06-25 Thread Darryl Bond

Thanks for your prompt response.
Given that my mon.c /var/lib/ceph/mon/ceph-c is currently populated,
should I delete it's contents after removing the monitor and before
re-adding it?

Darryl

On 06/26/13 12:50, Mike Dawson wrote:

Darryl,

I've seen this issue a few times recently. I believe Joao was looking
into it at one point, but I don't know if it has been resolved (Any news
Joao?). Others have run into it too. Look closely at:

http://tracker.ceph.com/issues/4999
http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15

I'd recommend you submit this as a bug on the tracker.

It sounds like you have reliable quorum between a and b, that's good.
The workaround that has worked for me is to remove mon.c, then re-add
it. Assuming your monitor leveldb stores aren't too large, the process
is rather quick. Follow the instructions at:

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors

then

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors

- Mike


On 6/25/2013 10:34 PM, Darryl Bond wrote:

Upgrading a cluster from 6.1.3 to 6.1.4  with 3 monitors. Cluster had
been successfully upgraded from bobtail to cuttlefish and then from
6.1.2 to 6.1.3. There have been no changes to ceph.conf.

Node mon.a upgrade, a,b,c monitors OK after upgrade
Node mon.b upgrade a,b monitors OK after upgrade (note that c was not
available, even though I hadn't touched it)
Node mon.c very slow to install the upgrade, RAM was tight for some
reason and mon process was using half the RAM
Node mon.c shutdown mon.c
Node mon.c performed the upgrade
Node mon.c restart ceph - mon.c will not start


service ceph start mon.c

=== mon.c ===
Starting Ceph mon.c on ceph3...
[23992]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file
/var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph3...

 health HEALTH_WARN 1 mons down, quorum 0,1 a,b
 monmap e1: 3 mons at
{a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
election epoch 14224, quorum 0,1 a,b
 osdmap e1342: 18 osds: 18 up, 18 in
  pgmap v4058788: 5448 pgs: 5447 active+clean, 1
active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB /
47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s
 mdsmap e1: 0/0/1 up

Set debug mon = 20
Nothing going into logs other than assertion--- begin dump of recent
events ---
   0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal
(Aborted) **
   in thread 7fd5e81b57c0

   ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
   1: /usr/bin/ceph-mon() [0x596fe2]
   2: (()+0xf000) [0x7fd5e782]
   3: (gsignal()+0x35) [0x7fd5e619fba5]
   4: (abort()+0x148) [0x7fd5e61a1358]
   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d]
   6: (()+0x5eeb6) [0x7fd5e6a97eb6]
   7: (()+0x5eee3) [0x7fd5e6a97ee3]
   8: (()+0x5f10e) [0x7fd5e6a9810e]
   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x40a) [0x64a6aa]
   10: /usr/bin/ceph-mon() [0x65f916]
   11: /usr/bin/ceph-mon() [0x6960e9]
   12: (pick_addresses(CephContext*)+0x8d) [0x69624d]
   13: (main()+0x1a8a) [0x49786a]
   14: (__libc_start_main()+0xf5) [0x7fd5e618ba05]
   15: /usr/bin/ceph-mon() [0x499a69]
   NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.




The contents of this electronic message and any attachments are intended only 
for the addressee and may contain legally privileged, personal, sensitive or 
confidential information. If you are not the intended addressee, and have 
received this email, any transmission, distribution, downloading, printing or 
photocopying of the contents of this message or attachments is strictly 
prohibited. Any legal privilege or confidentiality attached to this message and 
attachments is not waived, lost or destroyed by reason of delivery to any 
person other than intended addressee. If you have received this message and are 
not the intended addressee you should notify the sender by return email and 
destroy all copies of the message and any attachments. Unless expressly 
attributed, the views expressed in this email do not necessarily represent the 
views of the company.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4

2013-06-25 Thread Mike Dawson
I've typically moved it off to a non-conflicting path in lieu of 
deleting it outright, but either way should work. IIRC, I used something 
like:


sudo mv /var/lib/ceph/mon/ceph-c /var/lib/ceph/mon/ceph-c-bak && sudo 
mkdir /var/lib/ceph/mon/ceph-c


- Mike

On 6/25/2013 11:08 PM, Darryl Bond wrote:

Thanks for your prompt response.
Given that my mon.c /var/lib/ceph/mon/ceph-c is currently populated,
should I delete it's contents after removing the monitor and before
re-adding it?

Darryl

On 06/26/13 12:50, Mike Dawson wrote:

Darryl,

I've seen this issue a few times recently. I believe Joao was looking
into it at one point, but I don't know if it has been resolved (Any news
Joao?). Others have run into it too. Look closely at:

http://tracker.ceph.com/issues/4999
http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15

I'd recommend you submit this as a bug on the tracker.

It sounds like you have reliable quorum between a and b, that's good.
The workaround that has worked for me is to remove mon.c, then re-add
it. Assuming your monitor leveldb stores aren't too large, the process
is rather quick. Follow the instructions at:

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors


then

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors


- Mike


On 6/25/2013 10:34 PM, Darryl Bond wrote:

Upgrading a cluster from 6.1.3 to 6.1.4  with 3 monitors. Cluster had
been successfully upgraded from bobtail to cuttlefish and then from
6.1.2 to 6.1.3. There have been no changes to ceph.conf.

Node mon.a upgrade, a,b,c monitors OK after upgrade
Node mon.b upgrade a,b monitors OK after upgrade (note that c was not
available, even though I hadn't touched it)
Node mon.c very slow to install the upgrade, RAM was tight for some
reason and mon process was using half the RAM
Node mon.c shutdown mon.c
Node mon.c performed the upgrade
Node mon.c restart ceph - mon.c will not start


service ceph start mon.c

=== mon.c ===
Starting Ceph mon.c on ceph3...
[23992]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file
/var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph3...

 health HEALTH_WARN 1 mons down, quorum 0,1 a,b
 monmap e1: 3 mons at
{a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
election epoch 14224, quorum 0,1 a,b
 osdmap e1342: 18 osds: 18 up, 18 in
  pgmap v4058788: 5448 pgs: 5447 active+clean, 1
active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB /
47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s
 mdsmap e1: 0/0/1 up

Set debug mon = 20
Nothing going into logs other than assertion--- begin dump of recent
events ---
   0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal
(Aborted) **
   in thread 7fd5e81b57c0

   ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
   1: /usr/bin/ceph-mon() [0x596fe2]
   2: (()+0xf000) [0x7fd5e782]
   3: (gsignal()+0x35) [0x7fd5e619fba5]
   4: (abort()+0x148) [0x7fd5e61a1358]
   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d]
   6: (()+0x5eeb6) [0x7fd5e6a97eb6]
   7: (()+0x5eee3) [0x7fd5e6a97ee3]
   8: (()+0x5f10e) [0x7fd5e6a9810e]
   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x40a) [0x64a6aa]
   10: /usr/bin/ceph-mon() [0x65f916]
   11: /usr/bin/ceph-mon() [0x6960e9]
   12: (pick_addresses(CephContext*)+0x8d) [0x69624d]
   13: (main()+0x1a8a) [0x49786a]
   14: (__libc_start_main()+0xf5) [0x7fd5e618ba05]
   15: /usr/bin/ceph-mon() [0x499a69]
   NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.




The contents of this electronic message and any attachments are intended
only for the addressee and may contain legally privileged, personal,
sensitive or confidential information. If you are not the intended
addressee, and have received this email, any transmission, distribution,
downloading, printing or photocopying of the contents of this message or
attachments is strictly prohibited. Any legal privilege or
confidentiality attached to this message and attachments is not waived,
lost or destroyed by reason of delivery to any person other than
intended addressee. If you have received this message and are not the
intended addressee you should notify the sender by return email and
destroy all copies of the message and any attachments. Unless expressly
attributed, the views expressed in this email do not necessarily
represent the views of the company.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4

2013-06-25 Thread Darryl Bond

Nope, same outcome.

[root@ceph3 mon]# ceph mon remove c
removed mon.c at 192.168.6.103:6789/0, there are now 2 monitors
[root@ceph3 mon]# mkdir tmp
[root@ceph3 mon]# ceph auth get mon. -o tmp/keyring
exported keyring for mon.
[root@ceph3 mon]# ceph mon getmap -o tmp/monmap
2013-06-26 13:51:26.640097 7ffb48a12700  0 -- :/24748 >>
192.168.6.103:6789/0 pipe(0x1105350 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
got latest monmap
[root@ceph3 mon]# ls -l tmp
total 8
-rw-r--r--. 1 root root  55 Jun 26 13:51 keyring
-rw-r--r--. 1 root root 328 Jun 26 13:51 monmap
[root@ceph3 mon]# ceph-mon -i c --mkfs --monmap tmp/monmap --keyring
tmp/keyring
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-c for mon.c
[root@ceph3 mon]# ls ceph-c
keyring  store.db
[root@ceph3 mon]# ceph mon add c 192.168.6.103:6789
mon c 192.168.6.103:6789/0 already exists
[root@ceph3 mon]# ceph status
2013-06-26 13:53:58.401436 7f0dd653d700  0 -- :/25695 >>
192.168.6.103:6789/0 pipe(0x108e350 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
   health HEALTH_WARN 1 mons down, quorum 0,1 a,b
   monmap e3: 3 mons at
{a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
election epoch 14228, quorum 0,1 a,b
   osdmap e1342: 18 osds: 18 up, 18 in
pgmap v4060824: 5448 pgs: 5448 active+clean; 5820 GB data, 11673 GB
used, 35464 GB / 47137 GB avail; 2983KB/s rd, 1217KB/s wr, 552op/s
   mdsmap e1: 0/0/1 up

[root@ceph3 mon]# service ceph start mon.c
=== mon.c ===
Starting Ceph mon.c on ceph3...
[25887]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file
/var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph3...
[root@ceph3 mon]# ls ceph-c
keyring  store.db
[root@ceph3 mon]# ceph-mon -i c --public-addr 192.168.6.103:6789
[26768]: (33) Numerical argument out of domain

On 06/26/13 13:19, Mike Dawson wrote:

I've typically moved it off to a non-conflicting path in lieu of
deleting it outright, but either way should work. IIRC, I used something
like:

sudo mv /var/lib/ceph/mon/ceph-c /var/lib/ceph/mon/ceph-c-bak && sudo
mkdir /var/lib/ceph/mon/ceph-c

- Mike

On 6/25/2013 11:08 PM, Darryl Bond wrote:

Thanks for your prompt response.
Given that my mon.c /var/lib/ceph/mon/ceph-c is currently populated,
should I delete it's contents after removing the monitor and before
re-adding it?

Darryl

On 06/26/13 12:50, Mike Dawson wrote:

Darryl,

I've seen this issue a few times recently. I believe Joao was looking
into it at one point, but I don't know if it has been resolved (Any news
Joao?). Others have run into it too. Look closely at:

http://tracker.ceph.com/issues/4999
http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15

I'd recommend you submit this as a bug on the tracker.

It sounds like you have reliable quorum between a and b, that's good.
The workaround that has worked for me is to remove mon.c, then re-add
it. Assuming your monitor leveldb stores aren't too large, the process
is rather quick. Follow the instructions at:

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors


then

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors


- Mike


On 6/25/2013 10:34 PM, Darryl Bond wrote:

Upgrading a cluster from 6.1.3 to 6.1.4  with 3 monitors. Cluster had
been successfully upgraded from bobtail to cuttlefish and then from
6.1.2 to 6.1.3. There have been no changes to ceph.conf.

Node mon.a upgrade, a,b,c monitors OK after upgrade
Node mon.b upgrade a,b monitors OK after upgrade (note that c was not
available, even though I hadn't touched it)
Node mon.c very slow to install the upgrade, RAM was tight for some
reason and mon process was using half the RAM
Node mon.c shutdown mon.c
Node mon.c performed the upgrade
Node mon.c restart ceph - mon.c will not start


service ceph start mon.c

=== mon.c ===
Starting Ceph mon.c on ceph3...
[23992]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file
/var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph3...

  health HEALTH_WARN 1 mons down, quorum 0,1 a,b
  monmap e1: 3 mons at
{a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
election epoch 14224, quorum 0,1 a,b
  osdmap e1342: 18 osds: 18 up, 18 in
   pgmap v4058788: 5448 pgs: 5447 active+clean, 1
active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB /
47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s
  mdsmap e1: 0/0/1 up

Set debug mon = 20
Nothing going into logs other than assertion--- begin dump of recent
events ---
0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal
(Aborted) **
in thread 7fd5e81b57c0

ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
 

Re: [ceph-users] Issues going from 1 to 3 mons

2013-06-25 Thread Wolfgang Hennerbichler
On Tue, Jun 25, 2013 at 02:24:35PM +0100, Joao Eduardo Luis wrote:
> (Re-adding the list for future reference)
> 
> Wolfgang, from your log file:
> 
> 2013-06-25 14:58:39.739392 7fa329698780 -1 common/config.cc: In
> function 'void md_config_t::set_val_or_die(const char*, const
> char*)' thread 7fa329698780 time 2013-06-25 14:58:39.738501
> common/config.cc: 621: FAILED assert(ret == 0)
> 
>  ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
>  1: /usr/bin/ceph-mon() [0x660736]
>  2: /usr/bin/ceph-mon() [0x699d66]
>  3: (pick_addresses(CephContext*)+0x93) [0x69a1a3]
>  4: (main()+0x1e3f) [0x48256f]
>  5: (__libc_start_main()+0xed) [0x7fa3278f576d]
>  6: /usr/bin/ceph-mon() [0x4848bd]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> 
> This was initially reported on ticket #5205.  Sage fixed it last
> night, for ticket #5195.  Gary reports it fixed using Sage's patch,
> and said fix was backported to the cuttlefish branch.
> 
> It's worth to mention that the cuttlefish branch also contains a
> couple of commits that should boost monitor performance and avoid
> leveldb hangups.
> 
> Looking into #5195 (http://tracker.ceph.com/issues/5195) for more
> info is advised.  Let us know if you decide to try the cuttlefish
> branch (on the monitors) and whether it fixes the issue for you.
 
Hi Joao, 

thank you for looking into this. I hope to be able to try the latest cuttlefish 
branch, but currently my time is quite constrained, I can't guarantee it. So I 
can assume it will be fixed in the next stable release of cuttlefish, which is 
great. Thank you. 

> Thanks!
> 
>   -Joao

Thank you,
wogri

-- 
http://www.wogri.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4

2013-06-25 Thread Darryl Bond



Got it going.
This helped http://tracker.ceph.com/issues/5205

My ceph.conf has cluster and public addresses defined in global. I commented them out and mon.c started successfully.

[global]
    auth cluster required = cephx
    auth service required = cephx
    auth client required = cephx
#   public network = 192.168.6.0/24
#   cluster network = 10.6.0.0/16

# ceph status
   health HEALTH_OK
   monmap e3: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14230, quorum 0,1,2 a,b,c
   osdmap e1538: 18 osds: 17 up, 17 in
    pgmap v4064405: 5448 pgs: 5447 active+clean, 1 active+clean+scrubbing+deep; 5829 GB data, 11691 GB used, 34989 GB / 46681 GB avail; 328B/s rd, 816KB/s wr, 135op/s
   mdsmap e1: 0/0/1 up

Looks like there is a fix on the way.
Darryl

On 06/26/13 13:58, Darryl Bond wrote:


Nope, same outcome.

[root@ceph3 mon]# ceph mon remove c
removed mon.c at 192.168.6.103:6789/0, there are now 2 monitors
[root@ceph3 mon]# mkdir tmp
[root@ceph3 mon]# ceph auth get mon. -o tmp/keyring
exported keyring for mon.
[root@ceph3 mon]# ceph mon getmap -o tmp/monmap
2013-06-26 13:51:26.640097 7ffb48a12700  0 -- :/24748 >>
192.168.6.103:6789/0 pipe(0x1105350 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
got latest monmap
[root@ceph3 mon]# ls -l tmp
total 8
-rw-r--r--. 1 root root  55 Jun 26 13:51 keyring
-rw-r--r--. 1 root root 328 Jun 26 13:51 monmap
[root@ceph3 mon]# ceph-mon -i c --mkfs --monmap tmp/monmap --keyring
tmp/keyring
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-c for mon.c
[root@ceph3 mon]# ls ceph-c
keyring  store.db
[root@ceph3 mon]# ceph mon add c 192.168.6.103:6789
mon c 192.168.6.103:6789/0 already exists
[root@ceph3 mon]# ceph status
2013-06-26 13:53:58.401436 7f0dd653d700  0 -- :/25695 >>
192.168.6.103:6789/0 pipe(0x108e350 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
health HEALTH_WARN 1 mons down, quorum 0,1 a,b
monmap e3: 3 mons at
{a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
election epoch 14228, quorum 0,1 a,b
osdmap e1342: 18 osds: 18 up, 18 in
 pgmap v4060824: 5448 pgs: 5448 active+clean; 5820 GB data, 11673 GB
used, 35464 GB / 47137 GB avail; 2983KB/s rd, 1217KB/s wr, 552op/s
mdsmap e1: 0/0/1 up

[root@ceph3 mon]# service ceph start mon.c
=== mon.c ===
Starting Ceph mon.c on ceph3...
[25887]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file
/var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph3...
[root@ceph3 mon]# ls ceph-c
keyring  store.db
[root@ceph3 mon]# ceph-mon -i c --public-addr 192.168.6.103:6789
[26768]: (33) Numerical argument out of domain

On 06/26/13 13:19, Mike Dawson wrote:


I've typically moved it off to a non-conflicting path in lieu of
deleting it outright, but either way should work. IIRC, I used something
like:

sudo mv /var/lib/ceph/mon/ceph-c /var/lib/ceph/mon/ceph-c-bak && sudo
mkdir /var/lib/ceph/mon/ceph-c

- Mike

On 6/25/2013 11:08 PM, Darryl Bond wrote:


Thanks for your prompt response.
Given that my mon.c /var/lib/ceph/mon/ceph-c is currently populated,
should I delete it's contents after removing the monitor and before
re-adding it?

Darryl

On 06/26/13 12:50, Mike Dawson wrote:


Darryl,

I've seen this issue a few times recently. I believe Joao was looking
into it at one point, but I don't know if it has been resolved (Any news
Joao?). Others have run into it too. Look closely at:

http://tracker.ceph.com/issues/4999
http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15

I'd recommend you submit this as a bug on the tracker.

It sounds like you have reliable quorum between a and b, that's good.
The workaround that has worked for me is to remove mon.c, then re-add
it. Assuming your monitor leveldb stores aren't too large, the process
is rather quick. Follow the instructions at:

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors


then

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors


- Mike


On 6/25/2013 10:34 PM, Darryl Bond wrote:


Upgrading a cluster from 6.1.3 to 6.1.4  with 3 monitors. Cluster had
been successfully upgraded from bobtail to cuttlefish and then from
6.1.2 to 6.1.3. There have been no changes to ceph.conf.

Node mon.a upgrade, a,b,c monitors OK after upgrade
Node mon.b upgrade a,b monitors OK after upgrade (note that c was not
available, even though I hadn't touched it)
Node mon.c very slow to install the upgrade, RAM was tight for some
reason and mon process was using half the RAM
Node mon.c shutdown mon.c
Node mon.c performed the upgrade
Node mon.c restart ceph - mon.c will not start


service ceph start mon.c

=== mon.c ===
Starting Ceph mon.c on ceph3...
[23992]: (33) Numeric