Re: [ceph-users] osd down

Craig Lewis Mon, 10 Nov 2014 14:53:08 -0800

Yes, removing an OSD before re-creating it will give you the same OSD ID.
That's my preferred method, because it keeps the crushmap the same.  Only
PGs that existed on the replaced disk need to be backfilled.


I don't know if adding the replacement to the same host then removing the
old OSD gives you the same CRUSH map as the reverse.  I suspect not,
because the OSDs are re-ordered on that host.


On Mon, Nov 10, 2014 at 1:29 PM, Shain Miley <smi...@npr.org> wrote:

>   Craig,
>
>  Thanks for the info.
>
>  I ended up doing a zap and then a create via ceph-deploy.
>
>  One question that I still have is surrounding adding the failed osd back
> into the pool.
>
>  In this example...osd.70 was bad....when I added it back in via
> ceph-deploy...the disk was brought up as osd.108.
>
>  Only after osd.108 was up and running did I think to remove osd.70 from
> the crush map etc.
>
>  My question is this...had I removed it from the crush map prior to my
> ceph-deploy create...should/would Ceph have reused the osd number 70?
>
>  I would prefer to replace a failed disk with a new one and keep the old
> osd assignment...if possible that is why I am asking.
>
>  Anyway...thanks again for all the help.
>
>  Shain
>
> Sent from my iPhone
>
> On Nov 7, 2014, at 2:09 PM, Craig Lewis <cle...@centraldesktop.com> wrote:
>
>   I'd stop that osd daemon, and run xfs_check / xfs_repair on that
> partition.
>
>  If you repair anything, you should probably force a deep-scrub on all
> the PGs on that disk.  I think ceph osd deep-scrub <osdid> will do that,
> but you might have to manually grep ceph pg dump .
>
>
>  Or you could just treat it like a failed disk, but re-use the disk. 
> ceph-disk-prepare
> --zap-disk should take care of you.
>
>
> On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley <smi...@npr.org> wrote:
>
>> I tried restarting all the osd's on that node, osd.70 was the only ceph
>> process that did not come back online.
>>
>> There is nothing in the ceph-osd log for osd.70.
>>
>> However I do see over 13,000 of these messages in the kern.log:
>>
>> Nov  6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1):
>> xfs_log_force: error 5 returned.
>>
>> Does anyone have any suggestions on how I might be able to get this HD
>> back in the cluster (or whether or not it is worth even trying).
>>
>> Thanks,
>>
>> Shain
>>
>> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
>> smi...@npr.org | 202.513.3649
>>
>> ________________________________________
>> From: Shain Miley [smi...@npr.org]
>> Sent: Tuesday, November 04, 2014 3:55 PM
>> To: ceph-users@lists.ceph.com
>> Subject: osd down
>>
>> Hello,
>>
>> We are running ceph version 0.80.5 with 108 osd's.
>>
>> Today I noticed that one of the osd's is down:
>>
>> root@hqceph1:/var/log/ceph# ceph -s
>>      cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
>>       health HEALTH_WARN crush map has legacy tunables
>>       monmap e1: 3 mons at
>> {hqceph1=
>> 10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0
>> },
>> election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3
>>       osdmap e7119: 108 osds: 107 up, 107 in
>>        pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects
>>              216 TB used, 171 TB / 388 TB avail
>>                  3204 active+clean
>>                     4 active+clean+scrubbing
>>    client io 4079 kB/s wr, 8 op/s
>>
>>
>> Using osd dump I determined that it is osd number 70:
>>
>> osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913
>> last_clean_interval [488,2665) 10.35.1.217:6814/22440
>> 10.35.1.217:6820/22440 10.35.1.217:6824/22440 10.35.1.217:6830/22440
>> autoout,exists <http://10.35.1.217:6830/22440autoout,exists>
>> 5dbd4a14-5045-490e-859b-15533cd67568
>>
>>
>> Looking at that node, the drive is still mounted and I did not see any
>> errors in any of the system logs, and the raid level status shows the
>> drive as up and healthy, etc.
>>
>>
>> root@hqosd6:~# df -h |grep 70
>> /dev/sdl1       3.7T  1.9T  1.9T  51% /var/lib/ceph/osd/ceph-70
>>
>>
>> I was hoping that someone might be able to advise me on the next course
>> of action (can I add the osd back in?, should I replace the drive
>> altogether, etc)
>>
>> I have attached the osd log to this email.
>>
>> Any suggestions would be great.
>>
>> Thanks,
>>
>> Shain
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
>> smi...@npr.org | 202.513.3649
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd down

Reply via email to