Hi,
We run 3 production clusters in a multi-site setup. They were deployed with
Ceph-Ansible but recently switched to cephadm while on the Pacific release.
Shortly after migrating to cephadm they were upgraded to Quincy. Since moving
to Quincy, the recovery on one of the replica sites has tanked quite severely.
We use 8Tb SAS HDDs for OSDs with WAL and DB on nVME drives. Before the upgrade
it would take 1-2 days to resilver a OSD, but recently we replaced a drive and
it took 6 days to resilver. Nothing else has changed with the configuration of
the cluster, and the other 2 are performing recovery fine as far as we can see.
Does anyone have any ideas of what could be the issue here or anywhere we can
check what is going on??
Also while recovering, we've noticed inaccurate recovery information in ceph
-s. We've seen the recovery section of ceph -s reporting its running at less
than 10 keys/s, but looking at the Degraded data redundancy we see this drop
multiple 100s of keys per second. Does anyone have any advice they can offer on
this too please??
Cheers
Iain
[ceph: root@gb4-li-cephgw-001 /]# ceph -s; sleep 3; ceph -s
cluster:
id: 6dabcf41-90d7-4e90-b259-1cc0bf298052
health: HEALTH_WARN
noout,norebalance flag(s) set
Degraded data redundancy: 59610/371260626 objects degraded
(0.016%), 1 pg degraded, 1 pg undersized
services:
mon: 3 daemons, quorum
gb4-li-cephgw-001,gb4-li-cephgw-002,gb4-li-cephgw-003 (age 2h)
mgr: gb4-li-cephgw-003(active, since 2h), standbys:
gb4-li-cephgw-002.iqmxgu, gb4-li-cephgw-001
osd: 72 osds: 72 up (since 39m), 72 in (since 3d); 1 remapped pgs
flags noout,norebalance
rgw: 3 daemons active (3 hosts, 1 zones)
data:
pools: 11 pools, 1457 pgs
objects: 63.76M objects, 173 TiB
usage: 251 TiB used, 275 TiB / 526 TiB avail
pgs: 59610/371260626 objects degraded (0.016%)
1452 active+clean
4 active+clean+scrubbing+deep
1 active+undersized+degraded+remapped+backfilling
io:
client: 249 MiB/s rd, 611 KiB/s wr, 355 op/s rd, 402 op/s wr
recovery: 13 KiB/s, 3 keys/s, 2 objects/s
progress:
Global Recovery Event (39m)
[===========================.] (remaining: 1s)
cluster:
id: 6dabcf41-90d7-4e90-b259-1cc0bf298052
health: HEALTH_WARN
noout,norebalance flag(s) set
Degraded data redundancy: 59116/371260644 objects degraded
(0.016%), 1 pg degraded, 1 pg undersized
services:
mon: 3 daemons, quorum
gb4-li-cephgw-001,gb4-li-cephgw-002,gb4-li-cephgw-003 (age 2h)
mgr: gb4-li-cephgw-003(active, since 2h), standbys:
gb4-li-cephgw-002.iqmxgu, gb4-li-cephgw-001
osd: 72 osds: 72 up (since 39m), 72 in (since 3d); 1 remapped pgs
flags noout,norebalance
rgw: 3 daemons active (3 hosts, 1 zones)
data:
pools: 11 pools, 1457 pgs
objects: 63.76M objects, 173 TiB
usage: 251 TiB used, 275 TiB / 526 TiB avail
pgs: 59116/371260644 objects degraded (0.016%)
1452 active+clean
4 active+clean+scrubbing+deep
1 active+undersized+degraded+remapped+backfilling
io:
client: 258 MiB/s rd, 595 KiB/s wr, 346 op/s rd, 387 op/s wr
recovery: 15 KiB/s, 2 keys/s, 2 objects/s
progress:
Global Recovery Event (39m)
[===========================.] (remaining: 1s)
[ceph: root@gb4-li-cephgw-001 /]# ceph -s; sleep 3; ceph -s
cluster:
id: 6dabcf41-90d7-4e90-b259-1cc0bf298052
health: HEALTH_WARN
noout,norebalance flag(s) set
Degraded data redundancy: 58503/371260638 objects degraded
(0.016%), 1 pg degraded, 1 pg undersized
services:
mon: 3 daemons, quorum
gb4-li-cephgw-001,gb4-li-cephgw-002,gb4-li-cephgw-003 (age 2h)
mgr: gb4-li-cephgw-003(active, since 2h), standbys:
gb4-li-cephgw-002.iqmxgu, gb4-li-cephgw-001
osd: 72 osds: 72 up (since 39m), 72 in (since 3d); 1 remapped pgs
flags noout,norebalance
rgw: 3 daemons active (3 hosts, 1 zones)
data:
pools: 11 pools, 1457 pgs
objects: 63.76M objects, 173 TiB
usage: 251 TiB used, 275 TiB / 526 TiB avail
pgs: 58503/371260638 objects degraded (0.016%)
1452 active+clean
4 active+clean+scrubbing+deep
1 active+undersized+degraded+remapped+backfilling
io:
client: 245 MiB/s rd, 278 KiB/s wr, 247 op/s rd, 183 op/s wr
recovery: 16 KiB/s, 2 keys/s, 2 objects/s
progress:
Global Recovery Event (39m)
[===========================.] (remaining: 1s)
cluster:
id: 6dabcf41-90d7-4e90-b259-1cc0bf298052
health: HEALTH_WARN
noout,norebalance flag(s) set
Degraded data redundancy: 58157/371260644 objects degraded
(0.016%), 1 pg degraded, 1 pg undersized
services:
mon: 3 daemons, quorum
gb4-li-cephgw-001,gb4-li-cephgw-002,gb4-li-cephgw-003 (age 2h)
mgr: gb4-li-cephgw-003(active, since 2h), standbys:
gb4-li-cephgw-002.iqmxgu, gb4-li-cephgw-001
osd: 72 osds: 72 up (since 39m), 72 in (since 3d); 1 remapped pgs
flags noout,norebalance
rgw: 3 daemons active (3 hosts, 1 zones)
data:
pools: 11 pools, 1457 pgs
objects: 63.76M objects, 173 TiB
usage: 251 TiB used, 275 TiB / 526 TiB avail
pgs: 58157/371260644 objects degraded (0.016%)
1452 active+clean
4 active+clean+scrubbing+deep
1 active+undersized+degraded+remapped+backfilling
io:
client: 243 MiB/s rd, 285 KiB/s wr, 252 op/s rd, 197 op/s wr
recovery: 13 KiB/s, 0 keys/s, 1 objects/s
progress:
Global Recovery Event (39m)
[===========================.] (remaining: 1s)
[ceph: root@gb4-li-cephgw-001 /]#
Iain Stott
OpenStack Engineer
[email protected]
[THG Ingenuity Logo]<https://www.thg.com>
[https://i.imgur.com/wbpVRW6.png]<https://www.linkedin.com/company/thgplc/?originalSubdomain=uk>
[https://i.imgur.com/c3040tr.png] <https://twitter.com/thgplc?lang=en>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]