Good data point on not trimming when non active+clean PGs are present. So am I reading this correct? It grew to 32GB? Did it end up growing beyond that, what was the max? Also is only ~18PGs per OSD a reasonable amount of PGs per OSD? I would think about quadruple that would be ideal. Is this an artifact of a steadily growing cluster or a design choice?
On Sat, Feb 3, 2018 at 10:50 AM, Wido den Hollander <w...@42on.com> wrote: > Hi, > > I just wanted to inform people about the fact that Monitor databases can > grow quite big when you have a large cluster which is performing a very > long rebalance. > > I'm posting this on ceph-users and ceph-large as it applies to both, but > you'll see this sooner on a cluster with a lof of OSDs. > > Some information: > > - Version: Luminous 12.2.2 > - Number of OSDs: 2175 > - Data used: ~2PB > > We are in the middle of migrating from FileStore to BlueStore and this is > causing a lot of PGs to backfill at the moment: > > 33488 active+clean > 4802 active+undersized+degraded+remapped+backfill_wait > 1670 active+remapped+backfill_wait > 263 active+undersized+degraded+remapped+backfilling > 250 active+recovery_wait+degraded > 54 active+recovery_wait+degraded+remapped > 27 active+remapped+backfilling > 13 active+recovery_wait+undersized+degraded+remapped > 2 active+recovering+degraded > > This has been running for a few days now and it has caused this warning: > > MON_DISK_BIG mons srv-zmb03-05,srv-zmb04-05,srv- > zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a lot of disk space > mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB) > mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB) > mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB) > mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB) > mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB) > > This is to be expected as MONs do not trim their store if one or more PGs > is not active+clean. > > In this case we expected this and the MONs are each running on a 1TB Intel > DC-series SSD to make sure we do not run out of space before the backfill > finishes. > > The cluster is spread out over racks and in CRUSH we replicate over racks. > Rack by rack we are wiping/destroying the OSDs and bringing them back as > BlueStore OSDs and letting the backfill handle everything. > > In between we wait for the cluster to become HEALTH_OK (all PGs > active+clean) so that the Monitors can trim their database before we start > with the next rack. > > I just want to warn and inform people about this. Under normal > circumstances a MON database isn't that big, but if you have a very long > period of backfills/recoveries and also have a large number of OSDs you'll > see the DB grow quite big. > > This has improved significantly going to Jewel and Luminous, but it is > still something to watch out for. > > Make sure your MONs have enough free space to handle this! > > Wido > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Respectfully, Wes Dillingham wes_dilling...@harvard.edu Research Computing | Senior CyberInfrastructure Storage Engineer Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com