So several events unfolded that may have led to this situation.  Some of
them in hindsight were probably not the smartest decision around adjusting
the ec pool and restarting the OSD's several times during these migrations.


   1. Added a new 6th OSD with ceph-ansible
      1. Hung during restart of OSD's because they were set to noup and one
      of the original OSD's wouldn't come back online because of the noup.
      Manually unset noup and all 6 OSD's went up/in.
   2. Objects showing in degraded/misplaced
   3. Strange behavior restarting one OSD at a time and waiting for it to
   stabilize, depending on which was the last OSD restarted, different
   resulting backfill or move operations were taking place.
   4. Adjusted recovery/backfill sleep/concurrent moves to speed up
   re-location.
   5. Decided that if all the data was going to move, I should adjust my
   jerasure ec profile from k=4, m=1 -> k=5, m=1 with force(is this even
   recommended vs. just creating new pools???)
      1. Initially it unset crush-device-class=hdd to be blank
      2. Re-set crush-device-class
      3. Couldn't determine if this had any effect on the move operations.
      4. Changed back to k=4
   6. Let some of the backfill work through, ran into toofull situations
   even though OSD's had plenty of space.
      1. Decided to add PG's to the EC pool 64->150
   7. Restarted one OSD at a time again, waiting for them to be healthy
   before moving on.(probably should have been setting noout)
   8. Eventually one of the old OSD's refused to start due to a thread
   abort relating to stripe size(see below).
   9. Tried restarting other OSD's they all came back online fine.
   10. Some time passes and then the new OSD crashes, and won't start back
   up with the same stripe size abort.
      1. Now 2 OSD's are down, and won't start back up due to that same
      condition, and data is no longer available.
      2. 149 PG's showing as incomplete due to the min size 5(which
      shouldn't it be 1 from the original EC/new EC profile settings?)
      3. 1 pg as down
      4. 21 unknown
      5. Some of the PG's were still "new pg's" from increasing the PG size
      of the pool.

So yeah, somewhat of a cluster of changing too many things at once here,
but I didn't realize the things I was doing would potentially have this
result.

The two OSD's that won't start should still have all of the data on them,
it seems like they are having issues with at least one of the PG's in
particular from the EC pool that was adjusted, but presumably the rest of
the data should be fine, and hopefully there is a way to get them to start
up again.  I saw a similar issue posted in the list a few years ago but
there was never any follow up from the user having the issue.

https://gist.github.com/arodd/c95355a7b55f3e4a94f21bc5e801943d
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to