Re: [ceph-users] Trying to rescue a lost quorum

2014-03-04 Thread Marc
UPDATE. I have determined mon sync heartbeat timeout to be triggering since increasing it also increases the duration of the sync attempts. Could those heartbeats be quorum-related? Thatd explain why they aren't being sent. Also is it safe to temporarily increase this timeout to say an hour or two

Re: [ceph-users] Trying to rescue a lost quorum

2014-03-02 Thread Marc
Hi, I had already figured that out later, thanks though. So back to .61.2 it was. I was then trying to see whether debug logging would tell me why the mons wont rejoin the cluster. Their logs look like this: (Interesting part at the bottom... I think) 2014-03-02 14:25:34.960372 7f7c13a6e700 10

Re: [ceph-users] Trying to rescue a lost quorum

2014-03-01 Thread Martin B Nielsen
Hi, You can't form quorom with your monitors on cuttlefish if you're mixing < 0.61.5 with any 0.61.5+ ( https://ceph.com/docs/master/release-notes/ ) => section about 0.61.5. I'll advice installing pre-0.61.5, form quorom and then upgrade to 0.61.9 (if needs be) - and then latest dumpling on top.

Re: [ceph-users] Trying to rescue a lost quorum

2014-02-27 Thread Marc
Hi, thanks for the reply. I updated one of the new mons. And after a resonably long init phase (inconsistent state), I am now seeing these: 2014-02-28 01:05:12.344648 7fe9d05cb700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption 2014-02-28 01:05:12.345599 7f

Re: [ceph-users] Trying to rescue a lost quorum

2014-02-27 Thread Gregory Farnum
On Thu, Feb 27, 2014 at 4:25 PM, Marc wrote: > Hi, > > I was handed a Ceph cluster that had just lost quorum due to 2/3 mons > (b,c) running out of disk space (using up 15GB each). We were trying to > rescue this cluster without service downtime. As such we freed up some > space to keep mon b runn

[ceph-users] Trying to rescue a lost quorum

2014-02-27 Thread Marc
Hi, I was handed a Ceph cluster that had just lost quorum due to 2/3 mons (b,c) running out of disk space (using up 15GB each). We were trying to rescue this cluster without service downtime. As such we freed up some space to keep mon b running a while longer, which succeeded, quorum restored (a,b