Re: [ceph-users] fast_read in EC pools

Oliver Freyermuth Mon, 26 Feb 2018 14:31:03 -0800

Am 26.02.2018 um 23:15 schrieb Gregory Farnum:
> 
> 
> On Mon, Feb 26, 2018 at 11:48 AM Oliver Freyermuth 
> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
>     >     >     The EC pool I am considering is k=4 m=2 with failure domain 
> host, on 6 hosts.
>     >     >     So necessarily, there is one shard for each host. If one host 
> goes down for a prolonged time,
>     >     >     there's no "logical" advantage of redistributing things - 
> since whatever you do, with 5 hosts, all PGs will stay in degraded state 
> anyways.
>     >     >
>     >     >     However, I noticed Ceph is remapping all PGs, and actively 
> moving data. I presume now this is done for two reasons:
>     >     >     - The remapping is needed since the primary OSD might be the 
> one which went down. But for remapping (I guess) there's no need to actually 
> move data,
>     >     >       or is there?
>     >     >     - The data movement is done to have the "k" shards available.
>     >     >     If it's really the case that "all shards are equal", then 
> data movement should not occur - or is this a bug / bad feature?
>     >     >
>     >     >
>     >     > If you lose one OSD out of a host, Ceph is going to try and 
> re-replicate the data onto the other OSDs in that host. Your PG size and the 
> CRUSH rule instructs it that the PG needs 6 different OSDs, and those OSDs 
> need to be placed on different hosts.
>     >     >
>     >     > You're right that gets very funny if your PG size is equal to the 
> number of hosts. We generally discourage people from running configurations 
> like that.
>     >
>     >     Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts) 
> would be our starting point - since we may add more hosts later (not too 
> soon-ish, but it's not excluded more may come in a year or so),
>     >     and migrating large EC pools to different settings still seems a 
> bit messy.
>     >     We can't really afford to reduce available storage significantly 
> more in the current setup, and would like to have the possibility to lose one 
> host (for example for an OS upgrade),
>     >     and then still lose a few disks in case they fail with bad timing.
>     >
>     >     >
>     >     > Or if you mean that you are losing a host, and the data is 
> shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a 
> result of EC pools' "indep" rather than "firstn" crush rules?)
>     >
>     >     They are indep, which I think is the default (no manual editing 
> done). I thought the main goal of indep was exactly to reduce data movement.
>     >     Indeed, it's very funny that data is moved, it certainly does not 
> help to increase redundancy ;-).
>     >
>     <snip>
>     >
>     > Can you also share the output of "ceph osd crush dump"?
> 
>     Attached.
> 
> 
> Yep, that all looks simple enough.
> 
> Do you have any "ceph -s" or other records from when this was occurring? Is 
> it actually deleting or migrating any of the existing shards, or is it just 
> that the shards which were previously on the out'ed OSDs are now getting 
> copied onto the remaining ones?
> 
> I think I finally understand what's happening here but would like to be sure. 
> :)
> -Greg
> 
> (In short: certain straws were previously mapping onto osd.[outed], but now 
> they map onto the remaining OSDs. Because everything's independent, the 
> actual CRUSH mapping for any shard other than the last is now going to map 
> onto a remaining OSD, which would displace the shard it already holds. But 
> the previously-present shard is going to remain "remapped" there because it 
> can't map successfully. So if you lose osd.5, you'll go from a CRUSH mapping 
> like [1,3,5,0,2,4] to [1,3,4,0,2,UNMAPPED], but in reality shards 2 and 5 
> will both be on OSD 4.)


Interesting! This would also mean that space usage on the remaining-active OSDs 
would increase by 1/6 in our setup, which is significant. 
So that's another good reason to use mon_osd_down_out_subtree_limit=host or to 
just set "ceph osd set noout" when actively reinstalling a host. 

I reproduced just now. Here's what I see (ignore the inconsistent PG, that's 
unrelated and likely a cause of previous OSD OOM issues): 
# ceph -s
  cluster:
    id:     69b1fbe5-f084-4410-a99a-ab57417e7846
    health: HEALTH_ERR
            41569430/513248666 objects misplaced (8.099%)
            1 scrub errors
            Possible data damage: 1 pg inconsistent
            Degraded data redundancy: 105575103/513248666 objects degraded 
(20.570%), 2176 pgs degraded, 985 pgs undersized
 
  services:
    mon: 3 daemons, quorum mon003,mon001,mon002
    mgr: mon002(active), standbys: mon001, mon003
    mds: cephfs_baf-1/1/1 up  {0=mon002=up:active}, 1 up:standby-replay, 1 
up:standby
    osd: 196 osds: 164 up, 164 in; 1166 remapped pgs
 
  data:
    pools:   2 pools, 2176 pgs
    objects: 89370k objects, 4488 GB
    usage:   29546 GB used, 555 TB / 584 TB avail
    pgs:     105575103/513248666 objects degraded (20.570%)
             41569430/513248666 objects misplaced (8.099%)
             1166 active+undersized+degraded+remapped+backfilling
             1009 active+undersized+degraded
             1    active+undersized+degraded+inconsistent
 
  io:
    client:   6784 kB/s rd, 6820 kB/s wr, 804 op/s rd, 1174 op/s wr
    recovery: 79333 kB/s, 27 keys/s, 1080 objects/s

In ceph health detail, I see:
    pg 2.7cd is active+undersized+degraded+remapped+backfilling, acting 
[91,63,33,163,2147483647,103]
    pg 2.7ce is stuck undersized for 114.063431, current state 
active+undersized+degraded+remapped+backfilling, last acting 
[31,121,157,2147483647,61,87]
    pg 2.7cf is stuck undersized for 110.842287, current state 
active+undersized+degraded+remapped+backfilling, last acting 
[163,36,2147483647,21,124,69]
    pg 2.7d0 is stuck undersized for 118.876276, current state 
active+undersized+degraded+remapped+backfilling, last acting 
[140,91,66,22,2147483647,112]
    pg 2.7d1 is stuck undersized for 388.377010, current state 
active+undersized+degraded, last acting [62,110,2147483647,31,141,81]
    pg 2.7d2 is stuck undersized for 111.265718, current state 
active+undersized+degraded+remapped+backfilling, last acting 
[54,125,2147483647,157,88,21]
    pg 2.7d3 is stuck undersized for 105.885607, current state 
active+undersized+degraded+remapped+backfilling, last acting 
[20,117,96,2147483647,144,54]
    pg 2.7d4 is stuck undersized for 112.693680, current state 
active+undersized+degraded+remapped+backfilling, last acting 
[105,145,71,60,2147483647,13]
    pg 2.7d5 is stuck undersized for 388.337919, current state 
active+undersized+degraded, last acting [142,90,19,60,2147483647,127]
[...]
While I saw, when the host's OSDs were only down, but still in:
    pg 2.7cd is active+undersized+degraded, acting [91,63,33,163,2147483647,103]
    pg 2.7ce is stuck undersized for 145.507311, current state 
active+undersized+degraded, last acting [31,121,157,2147483647,61,87]
    pg 2.7cf is stuck undersized for 143.293067, current state 
active+undersized+degraded, last acting [163,36,2147483647,21,124,69]
    pg 2.7d0 is stuck undersized for 145.461503, current state 
active+undersized+degraded, last acting [140,91,66,22,2147483647,112]
    pg 2.7d1 is stuck undersized for 145.496089, current state 
active+undersized+degraded, last acting [62,110,2147483647,31,141,81]
    pg 2.7d2 is stuck undersized for 145.513296, current state 
active+undersized+degraded, last acting [54,125,2147483647,157,88,21]
    pg 2.7d3 is stuck undersized for 145.503361, current state 
active+undersized+degraded, last acting [20,117,96,2147483647,144,54]
    pg 2.7d4 is stuck undersized for 145.484259, current state 
active+undersized+degraded, last acting [105,145,71,60,2147483647,13]
    pg 2.7d5 is stuck undersized for 145.456998, current state 
active+undersized+degraded, last acting [142,90,19,60,2147483647,127]

Does this match expectations? 

Cheers and many thanks!
        Oliver

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fast_read in EC pools

Reply via email to