Greg, I didn't do anything in that logging period, and no clients were connected to the cluster. That's the log generated from the "starting deep scrub" moment to the "wrongly marked out"
Yesterday I tried to upgrade osd to 0.57, but nothing has changed. So I deleted the whole osd.0 to force a rebuild from scratch, so I waited to reach a full "active+clean" state (I mean just before the start deep scrub moment), stopped all the osd, mkfs.xfs and all the steps to add the osd to the cluster - then I've restarted all the osd The cluster started to repopulate the device and i thought I was on the way, but... this morning i've found the ceph-osd process crashed and the device *full* (note that other devices are full at 70%) /dev/sda1 1.9T 1.8T 11G 100% /var/lib/ceph/osd/ceph-0 The log of the crash is here: https://docs.google.com/file/d/0B1lZcgrNMBAJaEpWT2hLemRwNEE/edit?usp=sharing Then, I increased the log verbosity and restarted the osd again. The log is here: https://docs.google.com/file/d/0B1lZcgrNMBAJSU1Nc0NSMjdnYU0/edit?usp=sharing Immediately I've noticed that the used space dropped to 64% /dev/sda1 1.9T 1.2T 673G 64% /var/lib/ceph/osd/ceph-0 So the osd is still getting full [of what?] and after 40 minutes starts to log only this line every 5 seconds: 2013-03-06 15:15:47.311447 7f0d55185700 0 -- 192.168.21.134:6808/16648 >> 192.168.21.134:6828/20837 pipe(0x18252780 sd=24 :46533 s=1 pgs=4821 cs=4 l=0).connect claims to be 192.168.21.134:6828/20733 not 192.168.21.134:6828/20837 - wrong node! Hope this helps 2013/3/6 Greg Farnum <g...@inktank.com> > On Tuesday, March 5, 2013 at 5:53 AM, Marco Aroldi wrote: > > Hi, > > I've collected a osd log with these parameters: > > > > debug osd = 20 > > debug ms = 1 > > debug filestore = 20 > > > > You can download it from here: > > > https://docs.google.com/file/d/0B1lZcgrNMBAJVjBqa1lJRndxc2M/edit?usp=sharing > > > > I have also captured a video to show the behavior in realtime: > http://youtu.be/708AI8PGy7k > > > Ah, this is interesting — the ceph-osd processes are using up the time, > not the filesystem or something. However, I don't see any reason for that > in a brief look at the OSD log here — can you describe what you did to the > OSD during that logging period? (In particular I see a lot of pg_log > messages, but not the sub op messages that would be associated with this > OSD doing a deep scrub, nor the internal heartbeat timeouts that the other > OSDs were generating.) > -Greg > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com