Am 26.02.2018 um 23:15 schrieb Gregory Farnum: > > > On Mon, Feb 26, 2018 at 11:48 AM Oliver Freyermuth > <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote: > > > > The EC pool I am considering is k=4 m=2 with failure domain > host, on 6 hosts. > > > So necessarily, there is one shard for each host. If one host > goes down for a prolonged time, > > > there's no "logical" advantage of redistributing things - > since whatever you do, with 5 hosts, all PGs will stay in degraded state > anyways. > > > > > > However, I noticed Ceph is remapping all PGs, and actively > moving data. I presume now this is done for two reasons: > > > - The remapping is needed since the primary OSD might be the > one which went down. But for remapping (I guess) there's no need to actually > move data, > > > or is there? > > > - The data movement is done to have the "k" shards available. > > > If it's really the case that "all shards are equal", then > data movement should not occur - or is this a bug / bad feature? > > > > > > > > > If you lose one OSD out of a host, Ceph is going to try and > re-replicate the data onto the other OSDs in that host. Your PG size and the > CRUSH rule instructs it that the PG needs 6 different OSDs, and those OSDs > need to be placed on different hosts. > > > > > > You're right that gets very funny if your PG size is equal to the > number of hosts. We generally discourage people from running configurations > like that. > > > > Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts) > would be our starting point - since we may add more hosts later (not too > soon-ish, but it's not excluded more may come in a year or so), > > and migrating large EC pools to different settings still seems a > bit messy. > > We can't really afford to reduce available storage significantly > more in the current setup, and would like to have the possibility to lose one > host (for example for an OS upgrade), > > and then still lose a few disks in case they fail with bad timing. > > > > > > > > Or if you mean that you are losing a host, and the data is > shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a > result of EC pools' "indep" rather than "firstn" crush rules?) > > > > They are indep, which I think is the default (no manual editing > done). I thought the main goal of indep was exactly to reduce data movement. > > Indeed, it's very funny that data is moved, it certainly does not > help to increase redundancy ;-). > > > <snip> > > > > Can you also share the output of "ceph osd crush dump"? > > Attached. > > > Yep, that all looks simple enough. > > Do you have any "ceph -s" or other records from when this was occurring? Is > it actually deleting or migrating any of the existing shards, or is it just > that the shards which were previously on the out'ed OSDs are now getting > copied onto the remaining ones? > > I think I finally understand what's happening here but would like to be sure. > :) > -Greg > > (In short: certain straws were previously mapping onto osd.[outed], but now > they map onto the remaining OSDs. Because everything's independent, the > actual CRUSH mapping for any shard other than the last is now going to map > onto a remaining OSD, which would displace the shard it already holds. But > the previously-present shard is going to remain "remapped" there because it > can't map successfully. So if you lose osd.5, you'll go from a CRUSH mapping > like [1,3,5,0,2,4] to [1,3,4,0,2,UNMAPPED], but in reality shards 2 and 5 > will both be on OSD 4.)
Interesting! This would also mean that space usage on the remaining-active OSDs would increase by 1/6 in our setup, which is significant. So that's another good reason to use mon_osd_down_out_subtree_limit=host or to just set "ceph osd set noout" when actively reinstalling a host. I reproduced just now. Here's what I see (ignore the inconsistent PG, that's unrelated and likely a cause of previous OSD OOM issues): # ceph -s cluster: id: 69b1fbe5-f084-4410-a99a-ab57417e7846 health: HEALTH_ERR 41569430/513248666 objects misplaced (8.099%) 1 scrub errors Possible data damage: 1 pg inconsistent Degraded data redundancy: 105575103/513248666 objects degraded (20.570%), 2176 pgs degraded, 985 pgs undersized services: mon: 3 daemons, quorum mon003,mon001,mon002 mgr: mon002(active), standbys: mon001, mon003 mds: cephfs_baf-1/1/1 up {0=mon002=up:active}, 1 up:standby-replay, 1 up:standby osd: 196 osds: 164 up, 164 in; 1166 remapped pgs data: pools: 2 pools, 2176 pgs objects: 89370k objects, 4488 GB usage: 29546 GB used, 555 TB / 584 TB avail pgs: 105575103/513248666 objects degraded (20.570%) 41569430/513248666 objects misplaced (8.099%) 1166 active+undersized+degraded+remapped+backfilling 1009 active+undersized+degraded 1 active+undersized+degraded+inconsistent io: client: 6784 kB/s rd, 6820 kB/s wr, 804 op/s rd, 1174 op/s wr recovery: 79333 kB/s, 27 keys/s, 1080 objects/s In ceph health detail, I see: pg 2.7cd is active+undersized+degraded+remapped+backfilling, acting [91,63,33,163,2147483647,103] pg 2.7ce is stuck undersized for 114.063431, current state active+undersized+degraded+remapped+backfilling, last acting [31,121,157,2147483647,61,87] pg 2.7cf is stuck undersized for 110.842287, current state active+undersized+degraded+remapped+backfilling, last acting [163,36,2147483647,21,124,69] pg 2.7d0 is stuck undersized for 118.876276, current state active+undersized+degraded+remapped+backfilling, last acting [140,91,66,22,2147483647,112] pg 2.7d1 is stuck undersized for 388.377010, current state active+undersized+degraded, last acting [62,110,2147483647,31,141,81] pg 2.7d2 is stuck undersized for 111.265718, current state active+undersized+degraded+remapped+backfilling, last acting [54,125,2147483647,157,88,21] pg 2.7d3 is stuck undersized for 105.885607, current state active+undersized+degraded+remapped+backfilling, last acting [20,117,96,2147483647,144,54] pg 2.7d4 is stuck undersized for 112.693680, current state active+undersized+degraded+remapped+backfilling, last acting [105,145,71,60,2147483647,13] pg 2.7d5 is stuck undersized for 388.337919, current state active+undersized+degraded, last acting [142,90,19,60,2147483647,127] [...] While I saw, when the host's OSDs were only down, but still in: pg 2.7cd is active+undersized+degraded, acting [91,63,33,163,2147483647,103] pg 2.7ce is stuck undersized for 145.507311, current state active+undersized+degraded, last acting [31,121,157,2147483647,61,87] pg 2.7cf is stuck undersized for 143.293067, current state active+undersized+degraded, last acting [163,36,2147483647,21,124,69] pg 2.7d0 is stuck undersized for 145.461503, current state active+undersized+degraded, last acting [140,91,66,22,2147483647,112] pg 2.7d1 is stuck undersized for 145.496089, current state active+undersized+degraded, last acting [62,110,2147483647,31,141,81] pg 2.7d2 is stuck undersized for 145.513296, current state active+undersized+degraded, last acting [54,125,2147483647,157,88,21] pg 2.7d3 is stuck undersized for 145.503361, current state active+undersized+degraded, last acting [20,117,96,2147483647,144,54] pg 2.7d4 is stuck undersized for 145.484259, current state active+undersized+degraded, last acting [105,145,71,60,2147483647,13] pg 2.7d5 is stuck undersized for 145.456998, current state active+undersized+degraded, last acting [142,90,19,60,2147483647,127] Does this match expectations? Cheers and many thanks! Oliver
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com