[ceph-users] Stuck/confused ceph cluster after physical migration of servers.

Sam Skipsey Fri, 25 Oct 2019 08:00:04 -0700

Hello everyone,

So: we have a mimic cluster (on the most recent mimic release), 3 mons, 8
data nodes (160 OSDs in total).


Recently, we had to physically migrate the cluster to a different location,
and had to do this in one go (partly because the new location does not
currently have direct network routes to the old one, so doing this server
by server would not have been possible).
The setup at the other side preserved the ip addresses and hostnames of all
of the servers.

We followed the instructions here:
https://ceph.io/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/
to bring the system into a stable state for migration.

When we brought the system up again (again, following the above
instructions), it seems to be in a weird state:

ceph health detail gives:


HEALTH_ERR 8862594/10690030 objects misplaced (82.905%); Degraded data
redundancy: 571553/10690030 objects degraded (5.347%), 518 pgs degraded, 66
pgs undersized; Degraded data redundancy (low space): 30 pgs
backfill_toofull; application not enabled on 3 pool(s)
OBJECT_MISPLACED 8862594/10690030 objects misplaced (82.905%)
PG_DEGRADED Degraded data redundancy: 571553/10690030 objects degraded
(5.347%), 518 pgs degraded, 66 pgs undersized
    pg 11.70e is active+recovery_wait+degraded, acting
[143,27,50,87,45,84,98,88,140,144]
    pg 11.711 is active+recovery_wait+degraded, acting
[124,152,71,146,116,158,118,138,84,137]
    pg 11.712 is active+recovery_wait+degraded, acting
[37,115,1,70,47,148,116,12,23,51]

    (snip a lot more pg 11.xxx entries which are in this state)

PG_DEGRADED_FULL Degraded data redundancy (low space): 30 pgs
backfill_toofull
    pg 12.4e is active+remapped+backfill_wait+backfill_toofull, acting
[103,49,81,111,86,33,7,109,65,60]
    pg 12.6b is active+remapped+backfill_wait+backfill_toofull, acting
[130,101,5,45,40,9,93,119,128,145]
    pg 12.6f is active+remapped+backfill_wait+backfill_toofull, acting
[99,69,18,86,28,3,100,159,127,80]
    pg 12.88 is active+remapped+backfill_wait+backfill_toofull, acting
[102,20,37,150,12,135,149,18,159,10]
    pg 12.8a is active+remapped+backfill_wait+backfill_toofull, acting
[144,39,157,145,4,153,129,100,150,131]

    (snip a lot more pg 12.xxx entries in this state)


Confusingly, the cluster is perfectly happy about the osds on the surface:

ceph osd status
+-----+-----------------------+-------+-------+--------+---------+--------+---------+-----------+
|  id |          host         |  used | avail | wr ops | wr data | rd ops |
rd data |   state   |
+-----+-----------------------+-------+-------+--------+---------+--------+---------+-----------+
|  0  | localhost.localdomain |  518G | 10.1T |    0   |     0   |    0   |
    0   | exists,up |
|  1  | localhost.localdomain |  519G | 10.1T |    0   |     0   |    0   |
    0   | exists,up |
|  2  | localhost.localdomain |  513G | 10.1T |    0   |     0   |    0   |
    0   | exists,up |
|  3  | localhost.localdomain |  520G | 10.1T |    0   |     0   |    0   |
    0   | exists,up |
|  4  | localhost.localdomain |  517G | 10.1T |    0   |     0   |    0   |
    0   | exists,up |
|  5  | localhost.localdomain |  517G | 10.1T |    0   |     0   |    0   |
    0   | exists,up |
|  6  | localhost.localdomain |  515G | 10.1T |    0   |     0   |    0   |
    0   | exists,up |
|  7  | localhost.localdomain |  515G | 10.1T |    0   |     0   |    0   |
    0   | exists,up |
|  8  | localhost.localdomain |  517G | 10.1T |    0   |     0   |    0   |
    0   | exists,up |
|  9  | localhost.localdomain |  515G | 10.1T |    0   |     0   |    0   |
    0   | exists,up |
|  10 | localhost.localdomain |  518G | 10.1T |    0   |     0   |    0   |
    0   | exists,up |

and all of the osd are between 513G and 526G in fullness, so barely full,
marked as "exists,up" -none of them are declaring any issues.

So: what has happened to the cluster, and how do I fix it? (How can pgs
think their backfull is too full, when all the OSDs are > 90% empty?)

Any help understanding this would be appreciated.

Sam

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Stuck/confused ceph cluster after physical migration of servers.

Reply via email to