Re: [ceph-users] OSD goes up and down - osd suicide?

Marco Aroldi Wed, 06 Mar 2013 06:25:16 -0800

Greg,
I didn't do anything in that logging period, and no clients were connected
to the cluster. That's the log generated from the "starting deep scrub"
moment to the "wrongly marked out"

Yesterday I tried to upgrade osd to 0.57, but nothing has changed.

So I deleted the whole osd.0 to force a rebuild from scratch, so I waited
to reach a full "active+clean" state (I mean just before the start deep
scrub moment), stopped all the osd, mkfs.xfs and all the steps to add the
osd to the cluster - then I've restarted all the osd
The cluster started to repopulate the device and i thought I was on the
way, but... this morning i've found the ceph-osd process crashed and the
device *full* (note that other devices are full at 70%)
/dev/sda1       1.9T  1.8T   11G 100% /var/lib/ceph/osd/ceph-0
The log of the crash is here:
https://docs.google.com/file/d/0B1lZcgrNMBAJaEpWT2hLemRwNEE/edit?usp=sharing

Then,
I increased the log verbosity and restarted the osd again. The log is here:
https://docs.google.com/file/d/0B1lZcgrNMBAJSU1Nc0NSMjdnYU0/edit?usp=sharing
Immediately I've noticed that the used space dropped to 64%
/dev/sda1       1.9T  1.2T  673G  64% /var/lib/ceph/osd/ceph-0

So the osd is still getting full [of what?] and after 40 minutes starts to
log only this line every 5 seconds:
2013-03-06 15:15:47.311447 7f0d55185700  0 -- 192.168.21.134:6808/16648 >>
192.168.21.134:6828/20837 pipe(0x18252780 sd=24 :46533 s=1 pgs=4821 cs=4
l=0).connect claims to be 192.168.21.134:6828/20733 not
192.168.21.134:6828/20837 - wrong node!

Hope this helps

2013/3/6 Greg Farnum <g...@inktank.com>

> On Tuesday, March 5, 2013 at 5:53 AM, Marco Aroldi wrote:
> > Hi,
> > I've collected a osd log with these parameters:
> >
> > debug osd = 20
> > debug ms = 1
> > debug filestore = 20
> >
> > You can download it from here:
> >
> https://docs.google.com/file/d/0B1lZcgrNMBAJVjBqa1lJRndxc2M/edit?usp=sharing
> >
> > I have also captured a video to show the behavior in realtime:
> http://youtu.be/708AI8PGy7k
> >
> Ah, this is interesting — the ceph-osd processes are using up the time,
> not the filesystem or something. However, I don't see any reason for that
> in a brief look at the OSD log here — can you describe what you did to the
> OSD during that logging period? (In particular I see a lot of pg_log
> messages, but not the sub op messages that would be associated with this
> OSD doing a deep scrub, nor the internal heartbeat timeouts that the other
> OSDs were generating.)
> -Greg
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD goes up and down - osd suicide?

Reply via email to