Hello all, I have ceph Luminous setup with filestore and bluestore OSDs. This cluster was deployed initially as Hammer, than I upgraded it to Jewel and eventually to Luminous. It’s heterogenous, we have SSDs, SAS 15K and 7.2K HDDs in it (see crush map attached). Earlier I converted 7.2K HDD from filestore to bluestore without any problem. After converting two SSDs from filestore to bluestore I ended up the following warning:
cluster: id: 089d3673-5607-404d-9351-2d4004043966 health: HEALTH_WARN Degraded data redundancy: 12566/4361616 objects degraded (0.288%), 6 pgs unclean, 6 pgs degraded, 6 pgs undersized 10 slow requests are blocked > 32 sec services: mon: 3 daemons, quorum 2,1,0 mgr: tw-dwt-prx-03(active), standbys: tw-dwt-prx-05, tw-dwt-prx-07 osd: 92 osds: 92 up, 92 in; 6 remapped pgs data: pools: 3 pools, 1024 pgs objects: 1419k objects, 5676 GB usage: 17077 GB used, 264 TB / 280 TB avail pgs: 12566/4361616 objects degraded (0.288%) 1018 active+clean 4 active+undersized+degraded+remapped+backfill_wait 2 active+undersized+degraded+remapped+backfilling io: client: 1567 kB/s rd, 2274 kB/s wr, 67 op/s rd, 186 op/s wr # rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR sas_sata 556G 142574 0 427722 0 0 0 48972431 478G 207803733 3035G sata_only 1939M 491 0 1473 0 0 0 3302 5003k 17170 2108M ssd_sata 5119G 1311028 0 3933084 0 0 12549 46982011 2474G 620926839 24962G total_objects 1454093 total_used 17080G total_avail 264T total_space 280T # ceph pg dump_stuck ok PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 22.ac active+undersized+degraded+remapped+backfilling [6,28,62] 6 [28,62] 28 22.85 active+undersized+degraded+remapped+backfilling [7,43,62] 7 [43,62] 43 22.146 active+undersized+degraded+remapped+backfill_wait [7,48,46] 7 [46,48] 46 22.4f active+undersized+degraded+remapped+backfill_wait [7,59,58] 7 [58,59] 58 22.d8 active+undersized+degraded+remapped+backfill_wait [7,48,46] 7 [46,48] 46 22.60 active+undersized+degraded+remapped+backfill_wait [7,50,34] 7 [34,50] 34 The pool I have problem with, has replicas on SSDs and 7.2K HDD with primary affinity set as 1 for SSD and 0 for HDD. All clients eventually ceased to operate, recovery speed is 1-2 objects per minute (which would take more than a week to recover 12500 objects). Another pool works fine. How I can speed up recovery process? Thank you, Ignaqui
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com