So you have your crushmap set to choose osd instead of choose host?

Did you wait for the cluster to recover between each OSD rebuild?  If you
rebuilt all 3 OSDs at the same time (or without waiting for a complete
recovery between them), that would cause this problem.



On Thu, Nov 20, 2014 at 11:40 AM, JIten Shah <jshah2...@me.com> wrote:

> Yes, it was a healthy cluster and I had to rebuild because the OSD’s got
> accidentally created on the root disk. Out of 4 OSD’s I had to rebuild 3 of
> them.
>
>
> [jshah@Lab-cephmon001 ~]$ ceph osd tree
> # id weight type name up/down reweight
> -1 0.5 root default
> -2 0.09999 host Lab-cephosd005
> 4 0.09999 osd.4 up 1
> -3 0.09999 host Lab-cephosd001
> 0 0.09999 osd.0 up 1
> -4 0.09999 host Lab-cephosd002
> 1 0.09999 osd.1 up 1
> -5 0.09999 host Lab-cephosd003
> 2 0.09999 osd.2 up 1
> -6 0.09999 host Lab-cephosd004
> 3 0.09999 osd.3 up 1
>
>
> [jshah@Lab-cephmon001 ~]$ ceph pg 2.33 query
> Error ENOENT: i don't have paid 2.33
>
> —Jiten
>
>
> On Nov 20, 2014, at 11:18 AM, Craig Lewis <cle...@centraldesktop.com>
> wrote:
>
> Just to be clear, this is from a cluster that was healthy, had a disk
> replaced, and hasn't returned to healthy?  It's not a new cluster that has
> never been healthy, right?
>
> Assuming it's an existing cluster, how many OSDs did you replace?  It
> almost looks like you replaced multiple OSDs at the same time, and lost
> data because of it.
>
> Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`?
>
>
> On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah <jshah2...@me.com> wrote:
>
>> After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded
>> mode. Sone are in the unclean and others are in the stale state. Somehow
>> the MDS is also degraded. How do I recover the OSD’s and the MDS back to
>> healthy ? Read through the documentation and on the web but no luck so far.
>>
>> pg 2.33 is stuck unclean since forever, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 0.30 is stuck unclean since forever, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 1.31 is stuck unclean since forever, current state
>> stale+active+degraded, last acting [2]
>> pg 2.32 is stuck unclean for 597129.903922, current state
>> stale+active+degraded, last acting [2]
>> pg 0.2f is stuck unclean for 597129.903951, current state
>> stale+active+degraded, last acting [2]
>> pg 1.2e is stuck unclean since forever, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 2.2d is stuck unclean since forever, current state
>> stale+active+degraded+remapped, last acting [2]
>> pg 0.2e is stuck unclean since forever, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 1.2f is stuck unclean for 597129.904015, current state
>> stale+active+degraded, last acting [2]
>> pg 2.2c is stuck unclean since forever, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 0.2d is stuck stale for 422844.566858, current state
>> stale+active+degraded, last acting [2]
>> pg 1.2c is stuck stale for 422598.539483, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 2.2f is stuck stale for 422598.539488, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 0.2c is stuck stale for 422598.539487, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 1.2d is stuck stale for 422598.539492, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 2.2e is stuck stale for 422598.539496, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 0.2b is stuck stale for 422598.539491, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 1.2a is stuck stale for 422598.539496, current state
>> stale+active+degraded+remapped, last acting [3]
>> pg 2.29 is stuck stale for 422598.539504, current state
>> stale+active+degraded+remapped, last acting [3]
>> .
>> .
>> .
>> 6 ops are blocked > 2097.15 sec
>> 3 ops are blocked > 2097.15 sec on osd.0
>> 2 ops are blocked > 2097.15 sec on osd.2
>> 1 ops are blocked > 2097.15 sec on osd.4
>> 3 osds have slow requests
>> recovery 40/60 objects degraded (66.667%)
>> mds cluster is degraded
>> mds.Lab-cephmon001 at X.X.16.111:6800/3424727 rank 0 is replaying journal
>>
>> —Jiten
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to