Re: [ceph-users] osd down

Shain Miley Mon, 10 Nov 2014 13:30:03 -0800

Craig,

Thanks for the info.


I ended up doing a zap and then a create via ceph-deploy.

One question that I still have is surrounding adding the failed osd back into 
the pool.

In this example...osd.70 was bad....when I added it back in via 
ceph-deploy...the disk was brought up as osd.108.

Only after osd.108 was up and running did I think to remove osd.70 from the 
crush map etc.

My question is this...had I removed it from the crush map prior to my 
ceph-deploy create...should/would Ceph have reused the osd number 70?

I would prefer to replace a failed disk with a new one and keep the old osd 
assignment...if possible that is why I am asking.

Anyway...thanks again for all the help.

Shain

Sent from my iPhone

On Nov 7, 2014, at 2:09 PM, Craig Lewis 
<cle...@centraldesktop.com<mailto:cle...@centraldesktop.com>> wrote:

I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition.

If you repair anything, you should probably force a deep-scrub on all the PGs 
on that disk.  I think ceph osd deep-scrub <osdid> will do that, but you might 
have to manually grep ceph pg dump .


Or you could just treat it like a failed disk, but re-use the disk. 
ceph-disk-prepare --zap-disk should take care of you.


On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley 
<smi...@npr.org<mailto:smi...@npr.org>> wrote:
I tried restarting all the osd's on that node, osd.70 was the only ceph process 
that did not come back online.

There is nothing in the ceph-osd log for osd.70.

However I do see over 13,000 of these messages in the kern.log:

Nov  6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1): xfs_log_force: 
error 5 returned.

Does anyone have any suggestions on how I might be able to get this HD back in 
the cluster (or whether or not it is worth even trying).

Thanks,

Shain

Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.org<mailto:smi...@npr.org> | 202.513.3649

________________________________________
From: Shain Miley [smi...@npr.org<mailto:smi...@npr.org>]
Sent: Tuesday, November 04, 2014 3:55 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: osd down

Hello,

We are running ceph version 0.80.5 with 108 osd's.

Today I noticed that one of the osd's is down:

root@hqceph1:/var/log/ceph# ceph -s
     cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
      health HEALTH_WARN crush map has legacy tunables
      monmap e1: 3 mons at
{hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0<http://10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0>},
election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3
      osdmap e7119: 108 osds: 107 up, 107 in
       pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects
             216 TB used, 171 TB / 388 TB avail
                 3204 active+clean
                    4 active+clean+scrubbing
   client io 4079 kB/s wr, 8 op/s


Using osd dump I determined that it is osd number 70:

osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913
last_clean_interval [488,2665) 
10.35.1.217:6814/22440<http://10.35.1.217:6814/22440>
10.35.1.217:6820/22440<http://10.35.1.217:6820/22440> 
10.35.1.217:6824/22440<http://10.35.1.217:6824/22440> 10.35.1.217:6830/22440
autoout,exists<http://10.35.1.217:6830/22440
autoout,exists> 5dbd4a14-5045-490e-859b-15533cd67568


Looking at that node, the drive is still mounted and I did not see any
errors in any of the system logs, and the raid level status shows the
drive as up and healthy, etc.


root@hqosd6:~# df -h |grep 70
/dev/sdl1       3.7T  1.9T  1.9T  51% /var/lib/ceph/osd/ceph-70


I was hoping that someone might be able to advise me on the next course
of action (can I add the osd back in?, should I replace the drive
altogether, etc)

I have attached the osd log to this email.

Any suggestions would be great.

Thanks,

Shain















--
Shain Miley | Manager of Systems and Infrastructure, Digital Media |
smi...@npr.org<mailto:smi...@npr.org> | 202.513.3649
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd down

Reply via email to