On Sun, Nov 25, 2018 at 2:41 PM Stefan Kooman <ste...@bit.nl> wrote: > Hi list, > > During cluster expansion (adding extra disks to existing hosts) some > OSDs failed (FAILED assert(0 == "unexpected error", _txc_add_transaction > error (39) Directory not empty not handled on operation 21 (op 1, > counting from 0), full details: https://8n1.org/14078/c534). We had > "norebalance", "nobackfill", and "norecover" flags set. After we unset > nobackfill and norecover (to let Ceph fix the degraded PGs) it would > recover all but 12 objects (2 PGs). We queried the PGs and the OSDs that > were supposed to have a copy of them, and they were already "probed". A > day later (~24 hours) it would still not have recovered the degraded > objects. After we unset the "norebalance" flag it would start > rebalancing, backfilling and recovering. The 12 degraded objects were > recovered. > > Is this expected behaviour? I would expect Ceph to always try to fix > degraded things first and foremost. Even "pg force-recover" and "pg > force-backfill" could not force recovery. >
I haven't dug into how the norebalance flag works, but I think this is expected — it presumably prevents OSDs from creating new copies of PGs, which is what needed to happen here. -Greg > > Gr. Stefan > > > > > -- > | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 > | GPG: 0xD14839C6 +31 318 648 688 > <+31%20318%20648%20688> / i...@bit.nl > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com