Hi!

We've found ourselves in state with our ceph cluster that we haven't seen 
before, and are looking for a bit of expertise to chime in. We're running a 
(potentially unusually laid out) moderately large luminous-based ceph cluster 
in a public cloud, with 234*8TB OSDs, with a single osd per cloud instance. 
Here's a snippet of our ceph status:

services:
    mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon4
    mgr: ceph-mon2(active), standbys: ceph-mon1, ceph-mon4
    osd: 234 osds: 234 up, 231 in

data:
    pools: 5 pools, 7968 pgs
    objects: 136.35M objects, 382TiB
    usage: 1.09PiB used, 713TiB / 1.79PiB avail
    pgs: 7924 active+clean
            44 active+clean+scrubbing+deep

Our layout is spread across 6 availability zones (with the majority of osds in 
three, us-east-1a,us-east-1b and us-east-1e). We've recently decided that a 
spread across six azs is unnecessary and potentially a detriment to 
performance, so we are working towards shifting our workload out of 3 of the 6 
azs, so that we are evenly placed in 3 azs.

Relevant Recent events:
1. We expanded by 24 osds to handle additional capacity needs as well as to 
provide the capacity necessary to remove all osds in us-east-1f.
2. After the re-balance related to #1 finished we expanded by an additional 12 
osds as a follow-on for the #1 change.
3. On the same day as #2, we also issued `ceph crush move` commands to move the 
location of oldest 20 osds which had previously not been configured with a 
"datacenter" bucket denoting their availability zone.

The re-balancing related to #3 caused quite a change in our cluster, resulting 
in hours with degraded pgs, and waves of "slow requests" from many of the osds 
as data shifted. Three of the osds which had their location moved are also 
wildly more full than the other osds (being in a nearfull, >85% utilized state 
where the other 231 osds are in the range of 58-63% utilized). A `ceph balancer 
optimize` plan shows no changes to be made by the balancer to rectify this. 
Because of their nearfull status, we have marked those three osds as "out". 
During the resulting re-balancing we experienced more "slow requests" piling 
up, with some pgs dipping into a peering or activating state (which obviously 
causes some user-visible trouble).

Where we are headed:
* We have yet to set the crush map to use a replicated_datacenter rule now that 
our osds conform to these buckets.
* We need to take action to remove/replace the three osds which were 
over-utilized.
* We need to remove the us-east-1f osds that provoked the event in #1 (of which 
there are 16).

What I'm hoping for a bit of direction on:
* A recommendation on the order of these operations that would result in the 
least user-visible time.
* Some hints or thoughts on why we are seeing flapping pg availability since 
the change in #3.
* General thoughts on our layout. It has been rather useful for unbounded rapid 
growth but it seems very likely the sheer count of osds/instances is optimal at 
this point.

Obviously I'm happy to provide any additional information that might be helpful 
that I've overlooked. Thanks so much for looking!
Steve
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to