I have started over from scratch a few times myself ;-)

Michael Kuriger
mk7...@yp.com
818-649-7235
MikeKuriger (IM)

From: JIten Shah <jshah2...@me.com<mailto:jshah2...@me.com>>
Date: Friday, November 21, 2014 at 9:44 AM
To: Michael Kuriger <mk7...@yp.com<mailto:mk7...@yp.com>>
Cc: Craig Lewis <cle...@centraldesktop.com<mailto:cle...@centraldesktop.com>>, 
ceph-users <ceph-us...@ceph.com<mailto:ceph-us...@ceph.com>>
Subject: Re: [ceph-users] pg's degraded

Thanks Michael. That was a good idea.

I did:

1. sudo service ceph stop mds

2. ceph mds newfs 1 0 —yes-i-really-mean-it (where 1 and 0 are pool ID’s for 
metadata and data)

3. ceph health (It was healthy now!!!)

4. sudo servie ceph start mds.$(hostname -s)

And I am back in business.

Thanks again.

—Jiten



On Nov 20, 2014, at 5:47 PM, Michael Kuriger 
<mk7...@yp.com<mailto:mk7...@yp.com>> wrote:

Maybe delete the pool and start over?


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of JIten 
Shah
Sent: Thursday, November 20, 2014 5:46 PM
To: Craig Lewis
Cc: ceph-users
Subject: Re: [ceph-users] pg's degraded

Hi Craig,

Recreating the missing PG’s fixed it.  Thanks for your help.

But when I tried to mount the Filesystem, it gave me the “mount error 5”. I 
tried to restart the MDS server but it won’t work. It tells me that it’s 
laggy/unresponsive.

BTW, all these machines are VM’s.

[jshah@Lab-cephmon001 ~]$ ceph health detail
HEALTH_WARN mds cluster is degraded; mds Lab-cephmon001 is laggy
mds cluster is degraded
mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 rank 0 is replaying journal
mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 is laggy/unresponsive


—Jiten

On Nov 20, 2014, at 4:20 PM, JIten Shah 
<jshah2...@me.com<mailto:jshah2...@me.com>> wrote:


Ok. Thanks.

—Jiten

On Nov 20, 2014, at 2:14 PM, Craig Lewis 
<cle...@centraldesktop.com<mailto:cle...@centraldesktop.com>> wrote:


If there's no data to lose, tell Ceph to re-create all the missing PGs.

ceph pg force_create_pg 2.33

Repeat for each of the missing PGs.  If that doesn't do anything, you might 
need to tell Ceph that you lost the OSDs.  For each OSD you moved, run ceph osd 
lost <OSDID>, then try the force_create_pg command again.

If that doesn't work, you can keep fighting with it, but it'll be faster to 
rebuild the cluster.



On Thu, Nov 20, 2014 at 1:45 PM, JIten Shah 
<jshah2...@me.com<mailto:jshah2...@me.com>> wrote:
Thanks for your help.

I was using puppet to install the OSD’s where it chooses a path over a device 
name. Hence it created the OSD in the path within the root volume since the 
path specified was incorrect.

And all 3 of the OSD’s were rebuilt at the same time because it was unused and 
we had not put any data in there.

Any way to recover from this or should i rebuild the cluster altogether.

—Jiten

On Nov 20, 2014, at 1:40 PM, Craig Lewis 
<cle...@centraldesktop.com<mailto:cle...@centraldesktop.com>> wrote:


So you have your crushmap set to choose osd instead of choose host?

Did you wait for the cluster to recover between each OSD rebuild?  If you 
rebuilt all 3 OSDs at the same time (or without waiting for a complete recovery 
between them), that would cause this problem.



On Thu, Nov 20, 2014 at 11:40 AM, JIten Shah 
<jshah2...@me.com<mailto:jshah2...@me.com>> wrote:
Yes, it was a healthy cluster and I had to rebuild because the OSD’s got 
accidentally created on the root disk. Out of 4 OSD’s I had to rebuild 3 of 
them.


[jshah@Lab-cephmon001 ~]$ ceph osd tree
# id weight type name up/down reweight
-1 0.5 root default
-2 0.09999 host Lab-cephosd005
4 0.09999 osd.4 up 1
-3 0.09999 host Lab-cephosd001
0 0.09999 osd.0 up 1
-4 0.09999 host Lab-cephosd002
1 0.09999 osd.1 up 1
-5 0.09999 host Lab-cephosd003
2 0.09999 osd.2 up 1
-6 0.09999 host Lab-cephosd004
3 0.09999 osd.3 up 1


[jshah@Lab-cephmon001 ~]$ ceph pg 2.33 query
Error ENOENT: i don't have paid 2.33

—Jiten


On Nov 20, 2014, at 11:18 AM, Craig Lewis 
<cle...@centraldesktop.com<mailto:cle...@centraldesktop.com>> wrote:


Just to be clear, this is from a cluster that was healthy, had a disk replaced, 
and hasn't returned to healthy?  It's not a new cluster that has never been 
healthy, right?

Assuming it's an existing cluster, how many OSDs did you replace?  It almost 
looks like you replaced multiple OSDs at the same time, and lost data because 
of it.

Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`?


On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah 
<jshah2...@me.com<mailto:jshah2...@me.com>> wrote:
After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded mode. 
Sone are in the unclean and others are in the stale state. Somehow the MDS is 
also degraded. How do I recover the OSD’s and the MDS back to healthy ? Read 
through the documentation and on the web but no luck so far.

pg 2.33 is stuck unclean since forever, current state 
stale+active+degraded+remapped, last acting [3]
pg 0.30 is stuck unclean since forever, current state 
stale+active+degraded+remapped, last acting [3]
pg 1.31 is stuck unclean since forever, current state stale+active+degraded, 
last acting [2]
pg 2.32 is stuck unclean for 597129.903922, current state 
stale+active+degraded, last acting [2]
pg 0.2f is stuck unclean for 597129.903951, current state 
stale+active+degraded, last acting [2]
pg 1.2e is stuck unclean since forever, current state 
stale+active+degraded+remapped, last acting [3]
pg 2.2d is stuck unclean since forever, current state 
stale+active+degraded+remapped, last acting [2]
pg 0.2e is stuck unclean since forever, current state 
stale+active+degraded+remapped, last acting [3]
pg 1.2f is stuck unclean for 597129.904015, current state 
stale+active+degraded, last acting [2]
pg 2.2c is stuck unclean since forever, current state 
stale+active+degraded+remapped, last acting [3]
pg 0.2d is stuck stale for 422844.566858, current state stale+active+degraded, 
last acting [2]
pg 1.2c is stuck stale for 422598.539483, current state 
stale+active+degraded+remapped, last acting [3]
pg 2.2f is stuck stale for 422598.539488, current state 
stale+active+degraded+remapped, last acting [3]
pg 0.2c is stuck stale for 422598.539487, current state 
stale+active+degraded+remapped, last acting [3]
pg 1.2d is stuck stale for 422598.539492, current state 
stale+active+degraded+remapped, last acting [3]
pg 2.2e is stuck stale for 422598.539496, current state 
stale+active+degraded+remapped, last acting [3]
pg 0.2b is stuck stale for 422598.539491, current state 
stale+active+degraded+remapped, last acting [3]
pg 1.2a is stuck stale for 422598.539496, current state 
stale+active+degraded+remapped, last acting [3]
pg 2.29 is stuck stale for 422598.539504, current state 
stale+active+degraded+remapped, last acting [3]
.
.
.
6 ops are blocked > 2097.15 sec
3 ops are blocked > 2097.15 sec on osd.0
2 ops are blocked > 2097.15 sec on osd.2
1 ops are blocked > 2097.15 sec on osd.4
3 osds have slow requests
recovery 40/60 objects degraded (66.667%)
mds cluster is degraded
mds.Lab-cephmon001 at X.X.16.111:6800/3424727 rank 0 is replaying journal

—Jiten


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=AAMF-g&c=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQ&r=CSYA9OS6Qd7fQySI2LDvlQ&m=JgSb-b0FRONVmlVdumAXU_om2ZAbqiai2UQ4cVfHVIE&s=2P6M7NJ28k4LxqgEQKD1Oq6lac--bO3hqbpDblH_664&e=>



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=AAMF-g&c=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQ&r=CSYA9OS6Qd7fQySI2LDvlQ&m=JgSb-b0FRONVmlVdumAXU_om2ZAbqiai2UQ4cVfHVIE&s=2P6M7NJ28k4LxqgEQKD1Oq6lac--bO3hqbpDblH_664&e=>


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=AAMF-g&c=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQ&r=CSYA9OS6Qd7fQySI2LDvlQ&m=JgSb-b0FRONVmlVdumAXU_om2ZAbqiai2UQ4cVfHVIE&s=2P6M7NJ28k4LxqgEQKD1Oq6lac--bO3hqbpDblH_664&e=>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=AAMF-g&c=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQ&r=CSYA9OS6Qd7fQySI2LDvlQ&m=ABvp4862D7vlEFX_R-WJZpa0fzqe_w2_2C-DBdAspKo&s=Ctdtb6hY_iyv7GAhhYUVtPGADA64n6Q8_Sh3Z7KID-o&e=>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to