My first guess would be PG overdose protection kicked in [1][2] You can try fixing it by increasing allowed number of PG per OSD with ceph tell mon.* injectargs '--mon_max_pg_per_osd 500' ceph tell osd.* injectargs '--mon_max_pg_per_osd 500' and then triggering CRUSH algorithm update by restarting an OSD for example.
[1] https://ceph.com/community/new-luminous-pg-overdose-protection/ [2] https://blog.widodh.nl/2018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/ 2018-03-17 12:15 GMT+03:00 Nico Schottelius <nico.schottel...@ungleich.ch>: > > Good morning, > > some days ago we created a new pool with 512 pgs, and originally 5 osds. > We use the device class "ssd" and a crush rule that maps all data for > the pool "ssd" to the ssd device class osds. > > While creating, one of the ssds failed and we are left with 4 osds: > > [10:00:22] server2.place6:/var/log/ceph# ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 135.12505 root default > -7 51.36911 host server2 > 15 hdd-big 9.09511 osd.15 up 1.00000 1.00000 > 20 hdd-big 9.09511 osd.20 up 1.00000 1.00000 > 21 hdd-big 9.09511 osd.21 up 1.00000 1.00000 > 7 hdd-small 4.54776 osd.7 up 1.00000 1.00000 > 8 hdd-small 4.54776 osd.8 up 1.00000 1.00000 > 10 hdd-small 4.54776 osd.10 up 1.00000 1.00000 > 26 hdd-small 4.54776 osd.26 up 1.00000 1.00000 > 14 notinuse 5.45741 osd.14 up 1.00000 1.00000 > 12 ssd 0.21767 osd.12 up 1.00000 1.00000 > 24 ssd 0.21767 osd.24 up 1.00000 1.00000 > -5 42.50967 host server3 > 9 hdd-big 9.09511 osd.9 up 1.00000 1.00000 > 16 hdd-big 9.09511 osd.16 up 1.00000 1.00000 > 19 hdd-big 9.09511 osd.19 up 1.00000 1.00000 > 3 hdd-small 4.54776 osd.3 up 1.00000 1.00000 > 5 hdd-small 4.54776 osd.5 up 1.00000 1.00000 > 6 hdd-small 4.54776 osd.6 up 1.00000 1.00000 > 11 notinuse 0.45424 osd.11 up 1.00000 1.00000 > 13 notinuse 0.90907 osd.13 up 1.00000 1.00000 > 25 ssd 0.21776 osd.25 up 1.00000 1.00000 > -2 41.24626 host server4 > 2 hdd-big 9.09511 osd.2 up 1.00000 1.00000 > 17 hdd-big 9.09511 osd.17 up 1.00000 1.00000 > 18 hdd-big 9.09511 osd.18 up 1.00000 1.00000 > 0 hdd-small 4.54776 osd.0 up 1.00000 1.00000 > 1 hdd-small 4.54776 osd.1 up 1.00000 1.00000 > 22 hdd-small 4.54776 osd.22 up 1.00000 1.00000 > 4 notinuse 0.09999 osd.4 up 1.00000 1.00000 > 23 ssd 0.21767 osd.23 up 1.00000 1.00000 > [10:04:27] server2.place6:/var/log/ceph# > > We first had about 160 pgs stuck in creating+activating. After > restarting all osds in the ssd class one by one, it shifted to > 100 activating and 60 creating+activating: > > > [10:00:18] server2.place6:/var/log/ceph# ceph -s > cluster: > id: 1ccd84f6-e362-4c50-9ffe-59436745e445 > health: HEALTH_ERR > 1803200/13770981 objects misplaced (13.094%) > Reduced data availability: 175 pgs inactive > Degraded data redundancy: 857547/13770981 objects degraded > (6.227%), 197 pgs degraded, 123 pgs undersized > 39 slow requests are blocked > 32 sec > 40 stuck requests are blocked > 4096 sec > > services: > mon: 3 daemons, quorum black1,black2,black3 > mgr: black3(active), standbys: black2, black1 > osd: 27 osds: 27 up, 27 in; 156 remapped pgs > > data: > pools: 2 pools, 1024 pgs > objects: 4482k objects, 17725 GB > usage: 55542 GB used, 83188 GB / 135 TB avail > pgs: 17.090% pgs not active > 857547/13770981 objects degraded (6.227%) > 1803200/13770981 objects misplaced (13.094%) > 640 active+clean > 105 active+undersized+degraded+remapped+backfill_wait > 100 activating > 60 creating+activating > 50 active+recovery_wait+degraded > 21 active+remapped+backfill_wait > 16 active+recovery_wait+undersized+degraded+remapped > 15 activating+degraded > 9 active+recovery_wait+degraded+remapped > 3 active+recovery_wait+remapped > 3 active+recovery_wait > 2 active+undersized+degraded+remapped+backfilling > > io: > client: 519 kB/s rd, 38025 kB/s wr, 4 op/s rd, 20 op/s wr > recovery: 1694 kB/s, 0 objects/s > > I looked into the archives, but did not find anything that directly > related to our situation. We are using ceph 12.2.4. > > An excerpt from our ceph health detail looks like this: > > HEALTH_ERR 1803116/13770981 objects misplaced (13.094%); Reduced data > availability: 175 pgs inactive; Degraded data redundancy: 856881/13770981 > objects degraded (6.222%), 197 pgs degraded, 123 pgs undersized; 53 slow > requests are blocked > 32 sec; 40 stuck requests are blocked > 4096 sec > OBJECT_MISPLACED 1803116/13770981 objects misplaced (13.094%) > PG_AVAILABILITY Reduced data availability: 175 pgs inactive > pg 7.118 is stuck inactive for 183000.110669, current state > creating+activating, last acting [12,23,25] > pg 7.11a is stuck inactive for 38143.679989, current state activating, > last acting [25,24,23] > pg 7.121 is stuck inactive for 38143.670149, current state activating, > last acting [25,23,12] > pg 7.123 is stuck inactive for 37184.100764, current state > activating+degraded, last acting [25,12,23] > pg 7.125 is stuck inactive for 38143.677390, current state activating, > last acting [25,24,23] > pg 7.126 is stuck inactive for 38164.127082, current state activating, > last acting [24,23,25] > pg 7.127 is stuck inactive for 183000.110669, current state > creating+activating, last acting [12,23,25] > pg 7.12b is stuck inactive for 183000.110669, current state > creating+activating, last acting [12,23,25] > > where pool 7 is the ssd pool. > > The pg query of 7.118 looks as follows: > > { > "state": "creating+activating", > "snap_trimq": "[1~5]", > "snap_trimq_len": 5, > "epoch": 5016, > "up": [ > 12, > 23, > 25 > ], > "acting": [ > 12, > 23, > 25 > ], > "actingbackfill": [ > "12", > "23", > "25" > ], > "info": { > "pgid": "7.118", > "last_update": "0'0", > "last_complete": "0'0", > "log_tail": "0'0", > "last_user_version": 0, > "last_backfill": "MAX", > "last_backfill_bitwise": 0, > "purged_snaps": [], > "history": { > "epoch_created": 4620, > "epoch_pool_created": 4620, > "last_epoch_started": 0, > "last_interval_started": 0, > "last_epoch_clean": 0, > "last_interval_clean": 0, > "last_epoch_split": 0, > "last_epoch_marked_full": 0, > "same_up_since": 4967, > "same_interval_since": 4967, > "same_primary_since": 4967, > "last_scrub": "0'0", > "last_scrub_stamp": "2018-03-15 07:18:46.197892", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892", > "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892" > }, > "stats": { > "version": "0'0", > "reported_seq": "406", > "reported_epoch": "5016", > "state": "creating+activating", > "last_fresh": "2018-03-17 10:12:58.380048", > "last_change": "2018-03-17 10:10:24.335405", > "last_active": "2018-03-15 07:18:46.197892", > "last_peered": "2018-03-15 07:18:46.197892", > "last_clean": "2018-03-15 07:18:46.197892", > "last_became_active": "0.000000", > "last_became_peered": "0.000000", > "last_unstale": "2018-03-17 10:12:58.380048", > "last_undegraded": "2018-03-17 10:12:58.380048", > "last_fullsized": "2018-03-17 10:12:58.380048", > "mapping_epoch": 4967, > "log_start": "0'0", > "ondisk_log_start": "0'0", > "created": 4620, > "last_epoch_clean": 0, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "0'0", > "last_scrub_stamp": "2018-03-15 07:18:46.197892", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892", > "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892", > "log_size": 0, > "ondisk_log_size": 0, > "stats_invalid": false, > "dirty_stats_invalid": false, > "omap_stats_invalid": false, > "hitset_stats_invalid": false, > "hitset_bytes_stats_invalid": false, > "pin_stats_invalid": false, > "snaptrimq_len": 5, > "stat_sum": { > "num_bytes": 0, > "num_objects": 0, > "num_object_clones": 0, > "num_object_copies": 0, > "num_objects_missing_on_primary": 0, > "num_objects_missing": 0, > "num_objects_degraded": 0, > "num_objects_misplaced": 0, > "num_objects_unfound": 0, > "num_objects_dirty": 0, > "num_whiteouts": 0, > "num_read": 0, > "num_read_kb": 0, > "num_write": 0, > "num_write_kb": 0, > "num_scrub_errors": 0, > "num_shallow_scrub_errors": 0, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 0, > "num_bytes_recovered": 0, > "num_keys_recovered": 0, > "num_objects_omap": 0, > "num_objects_hit_set_archive": 0, > "num_bytes_hit_set_archive": 0, > "num_flush": 0, > "num_flush_kb": 0, > "num_evict": 0, > "num_evict_kb": 0, > "num_promote": 0, > "num_flush_mode_high": 0, > "num_flush_mode_low": 0, > "num_evict_mode_some": 0, > "num_evict_mode_full": 0, > "num_objects_pinned": 0, > "num_legacy_snapsets": 0 > }, > "up": [ > 12, > 23, > 25 > ], > "acting": [ > 12, > 23, > 25 > ], > "blocked_by": [], > "up_primary": 12, > "acting_primary": 12 > }, > "empty": 1, > "dne": 0, > "incomplete": 0, > "last_epoch_started": 4968, > "hit_set_history": { > "current_last_update": "0'0", > "history": [] > } > }, > "peer_info": [ > { > "peer": "23", > "pgid": "7.118", > "last_update": "0'0", > "last_complete": "0'0", > "log_tail": "0'0", > "last_user_version": 0, > "last_backfill": "MAX", > "last_backfill_bitwise": 0, > "purged_snaps": [], > "history": { > "epoch_created": 0, > "epoch_pool_created": 0, > "last_epoch_started": 0, > "last_interval_started": 0, > "last_epoch_clean": 0, > "last_interval_clean": 0, > "last_epoch_split": 0, > "last_epoch_marked_full": 0, > "same_up_since": 0, > "same_interval_since": 0, > "same_primary_since": 0, > "last_scrub": "0'0", > "last_scrub_stamp": "0.000000", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "0.000000", > "last_clean_scrub_stamp": "0.000000" > }, > "stats": { > "version": "0'0", > "reported_seq": "0", > "reported_epoch": "0", > "state": "unknown", > "last_fresh": "0.000000", > "last_change": "0.000000", > "last_active": "0.000000", > "last_peered": "0.000000", > "last_clean": "0.000000", > "last_became_active": "0.000000", > "last_became_peered": "0.000000", > "last_unstale": "0.000000", > "last_undegraded": "0.000000", > "last_fullsized": "0.000000", > "mapping_epoch": 0, > "log_start": "0'0", > "ondisk_log_start": "0'0", > "created": 0, > "last_epoch_clean": 0, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "0'0", > "last_scrub_stamp": "0.000000", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "0.000000", > "last_clean_scrub_stamp": "0.000000", > "log_size": 0, > "ondisk_log_size": 0, > "stats_invalid": false, > "dirty_stats_invalid": false, > "omap_stats_invalid": false, > "hitset_stats_invalid": false, > "hitset_bytes_stats_invalid": false, > "pin_stats_invalid": false, > "snaptrimq_len": 0, > "stat_sum": { > "num_bytes": 0, > "num_objects": 0, > "num_object_clones": 0, > "num_object_copies": 0, > "num_objects_missing_on_primary": 0, > "num_objects_missing": 0, > "num_objects_degraded": 0, > "num_objects_misplaced": 0, > "num_objects_unfound": 0, > "num_objects_dirty": 0, > "num_whiteouts": 0, > "num_read": 0, > "num_read_kb": 0, > "num_write": 0, > "num_write_kb": 0, > "num_scrub_errors": 0, > "num_shallow_scrub_errors": 0, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 0, > "num_bytes_recovered": 0, > "num_keys_recovered": 0, > "num_objects_omap": 0, > "num_objects_hit_set_archive": 0, > "num_bytes_hit_set_archive": 0, > "num_flush": 0, > "num_flush_kb": 0, > "num_evict": 0, > "num_evict_kb": 0, > "num_promote": 0, > "num_flush_mode_high": 0, > "num_flush_mode_low": 0, > "num_evict_mode_some": 0, > "num_evict_mode_full": 0, > "num_objects_pinned": 0, > "num_legacy_snapsets": 0 > }, > "up": [], > "acting": [], > "blocked_by": [], > "up_primary": -1, > "acting_primary": -1 > }, > "empty": 1, > "dne": 1, > "incomplete": 0, > "last_epoch_started": 0, > "hit_set_history": { > "current_last_update": "0'0", > "history": [] > } > }, > { > "peer": "24", > "pgid": "7.118", > "last_update": "0'0", > "last_complete": "0'0", > "log_tail": "0'0", > "last_user_version": 0, > "last_backfill": "MAX", > "last_backfill_bitwise": 0, > "purged_snaps": [], > "history": { > "epoch_created": 4620, > "epoch_pool_created": 4620, > "last_epoch_started": 0, > "last_interval_started": 0, > "last_epoch_clean": 0, > "last_interval_clean": 0, > "last_epoch_split": 0, > "last_epoch_marked_full": 0, > "same_up_since": 4967, > "same_interval_since": 4967, > "same_primary_since": 4967, > "last_scrub": "0'0", > "last_scrub_stamp": "2018-03-15 07:18:46.197892", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892", > "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892" > }, > "stats": { > "version": "0'0", > "reported_seq": "164", > "reported_epoch": "4769", > "state": "creating+remapped+peering", > "last_fresh": "2018-03-16 23:49:04.258780", > "last_change": "2018-03-16 23:49:03.296077", > "last_active": "2018-03-15 07:18:46.197892", > "last_peered": "2018-03-15 07:18:46.197892", > "last_clean": "2018-03-15 07:18:46.197892", > "last_became_active": "0.000000", > "last_became_peered": "0.000000", > "last_unstale": "2018-03-16 23:49:04.258780", > "last_undegraded": "2018-03-16 23:49:04.258780", > "last_fullsized": "2018-03-16 23:49:04.258780", > "mapping_epoch": 4967, > "log_start": "0'0", > "ondisk_log_start": "0'0", > "created": 4620, > "last_epoch_clean": 0, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "0'0", > "last_scrub_stamp": "2018-03-15 07:18:46.197892", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892", > "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892", > "log_size": 0, > "ondisk_log_size": 0, > "stats_invalid": false, > "dirty_stats_invalid": false, > "omap_stats_invalid": false, > "hitset_stats_invalid": false, > "hitset_bytes_stats_invalid": false, > "pin_stats_invalid": false, > "snaptrimq_len": 0, > "stat_sum": { > "num_bytes": 0, > "num_objects": 0, > "num_object_clones": 0, > "num_object_copies": 0, > "num_objects_missing_on_primary": 0, > "num_objects_missing": 0, > "num_objects_degraded": 0, > "num_objects_misplaced": 0, > "num_objects_unfound": 0, > "num_objects_dirty": 0, > "num_whiteouts": 0, > "num_read": 0, > "num_read_kb": 0, > "num_write": 0, > "num_write_kb": 0, > "num_scrub_errors": 0, > "num_shallow_scrub_errors": 0, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 0, > "num_bytes_recovered": 0, > "num_keys_recovered": 0, > "num_objects_omap": 0, > "num_objects_hit_set_archive": 0, > "num_bytes_hit_set_archive": 0, > "num_flush": 0, > "num_flush_kb": 0, > "num_evict": 0, > "num_evict_kb": 0, > "num_promote": 0, > "num_flush_mode_high": 0, > "num_flush_mode_low": 0, > "num_evict_mode_some": 0, > "num_evict_mode_full": 0, > "num_objects_pinned": 0, > "num_legacy_snapsets": 0 > }, > "up": [ > 12, > 23, > 25 > ], > "acting": [ > 12, > 23, > 25 > ], > "blocked_by": [], > "up_primary": 12, > "acting_primary": 12 > }, > "empty": 1, > "dne": 0, > "incomplete": 0, > "last_epoch_started": 4769, > "hit_set_history": { > "current_last_update": "0'0", > "history": [] > } > }, > { > "peer": "25", > "pgid": "7.118", > "last_update": "0'0", > "last_complete": "0'0", > "log_tail": "0'0", > "last_user_version": 0, > "last_backfill": "MAX", > "last_backfill_bitwise": 0, > "purged_snaps": [], > "history": { > "epoch_created": 0, > "epoch_pool_created": 0, > "last_epoch_started": 0, > "last_interval_started": 0, > "last_epoch_clean": 0, > "last_interval_clean": 0, > "last_epoch_split": 0, > "last_epoch_marked_full": 0, > "same_up_since": 0, > "same_interval_since": 0, > "same_primary_since": 0, > "last_scrub": "0'0", > "last_scrub_stamp": "0.000000", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "0.000000", > "last_clean_scrub_stamp": "0.000000" > }, > "stats": { > "version": "0'0", > "reported_seq": "0", > "reported_epoch": "0", > "state": "unknown", > "last_fresh": "0.000000", > "last_change": "0.000000", > "last_active": "0.000000", > "last_peered": "0.000000", > "last_clean": "0.000000", > "last_became_active": "0.000000", > "last_became_peered": "0.000000", > "last_unstale": "0.000000", > "last_undegraded": "0.000000", > "last_fullsized": "0.000000", > "mapping_epoch": 0, > "log_start": "0'0", > "ondisk_log_start": "0'0", > "created": 0, > "last_epoch_clean": 0, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "0'0", > "last_scrub_stamp": "0.000000", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "0.000000", > "last_clean_scrub_stamp": "0.000000", > "log_size": 0, > "ondisk_log_size": 0, > "stats_invalid": false, > "dirty_stats_invalid": false, > "omap_stats_invalid": false, > "hitset_stats_invalid": false, > "hitset_bytes_stats_invalid": false, > "pin_stats_invalid": false, > "snaptrimq_len": 0, > "stat_sum": { > "num_bytes": 0, > "num_objects": 0, > "num_object_clones": 0, > "num_object_copies": 0, > "num_objects_missing_on_primary": 0, > "num_objects_missing": 0, > "num_objects_degraded": 0, > "num_objects_misplaced": 0, > "num_objects_unfound": 0, > "num_objects_dirty": 0, > "num_whiteouts": 0, > "num_read": 0, > "num_read_kb": 0, > "num_write": 0, > "num_write_kb": 0, > "num_scrub_errors": 0, > "num_shallow_scrub_errors": 0, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 0, > "num_bytes_recovered": 0, > "num_keys_recovered": 0, > "num_objects_omap": 0, > "num_objects_hit_set_archive": 0, > "num_bytes_hit_set_archive": 0, > "num_flush": 0, > "num_flush_kb": 0, > "num_evict": 0, > "num_evict_kb": 0, > "num_promote": 0, > "num_flush_mode_high": 0, > "num_flush_mode_low": 0, > "num_evict_mode_some": 0, > "num_evict_mode_full": 0, > "num_objects_pinned": 0, > "num_legacy_snapsets": 0 > }, > "up": [], > "acting": [], > "blocked_by": [], > "up_primary": -1, > "acting_primary": -1 > }, > "empty": 1, > "dne": 1, > "incomplete": 0, > "last_epoch_started": 0, > "hit_set_history": { > "current_last_update": "0'0", > "history": [] > } > } > ], > "recovery_state": [ > { > "name": "Started/Primary/Active", > "enter_time": "2018-03-17 10:10:24.335124", > "might_have_unfound": [ > { > "osd": "24", > "status": "not queried" > } > ], > "recovery_progress": { > "backfill_targets": [], > "waiting_on_backfill": [], > "last_backfill_started": "MIN", > "backfill_info": { > "begin": "MIN", > "end": "MIN", > "objects": [] > }, > "peer_backfill_info": [], > "backfills_in_flight": [], > "recovering": [], > "pg_backend": { > "pull_from_peer": [], > "pushing": [] > } > }, > "scrub": { > "scrubber.epoch_start": "0", > "scrubber.active": false, > "scrubber.state": "INACTIVE", > "scrubber.start": "MIN", > "scrubber.end": "MIN", > "scrubber.subset_last_update": "0'0", > "scrubber.deep": false, > "scrubber.seed": 0, > "scrubber.waiting_on": 0, > "scrubber.waiting_on_whom": [] > } > }, > { > "name": "Started", > "enter_time": "2018-03-17 10:10:23.373097" > } > ], > "agent_state": {} > } > > > If anyone has a hint on why it is stuck in creation, it would be very > much appreciated. > > Best, > > Nico > > -- > Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com