On Tue, 4 Jun 2013, Nigel Williams wrote:
> On 4/06/2013 9:16 AM, Chen, Xiaoxi wrote:
> > my 0.02? you really dont need to wait for health_ok between your
> > recovery steps,just go ahead. Everytime a new map be generated and
> > broadcasted,the old map and in-progress recovery will be canceled
> thanks Xiaoxi, that is helpful to know.
> It seems to me that there might be a failure-mode (or race-condition?)
> here though, as the cluster is now struggling to recover as the
> replacement OSD caused the cluster to go into backfill_toofull.
> The failure sequence might be:
> 1. From HEALTH_OK crash an OSD
> 2. Wait for recovery
> 3. Remove OSD using usual procedures
> 4. Wait for recovery
> 5. Add back OSD using usual procedures
> 6. Wait for recovery
> 7. Cluster is unable to recover due to toofull conditions
> Perhaps this is a needed test case to round-trip a cluster through a
> known failure/recovery scenario.
> Note this is using a simplistically configured test-cluster with CephFS
> in the mix and about 2.5 million files.
> Something else I noticed: I restarted the cluster (and set the leveldb
> compact option since I'd run out of space on the roots) and now I see it
> is again making progress on the backfill. Seems odd that the cluster
> pauses but a restart clears the pause, is that by design?

Does the monitor data directory share a disk with an OSD?  If so, that 
makes sense: compaction freed enough space to drop below the threshold...

ceph-users mailing list

Reply via email to