The problem is with never getting a successful run of `ceph-osd --flush-journal` on the old SSD journal drive. All of the OSDs that used the dead journal need to be removed from the cluster, wiped, and added back in. The data on them is not 100% consistent because the old journal died. Any word that made it to the journal and not the disk is bad.
Add on top of that your decision to run with replica size = 2 min_size = 1, anything that happens in your cluster becomes very dangerous for data loss. Seeing as you had 2 nodes sure near each other, there is a very real possibility that you will have some data loss from this. Regardless, your first step is to remove the OSDs that were on the failed journal. They are poison in your cluster. On Sun, Sep 2, 2018, 10:51 AM Lee <lqui...@gmail.com> wrote: > I followed: > > $ journal_uuid=$(sudo cat /var/lib/ceph/osd/ceph-0/journal_uuid) > $ sudo sgdisk --new=1:0:+20480M --change-name=1:'ceph journal' > --partition-guid=1:$journal_uuid > --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdk > > Then > > $ sudo ceph-osd --mkjournal -i 20 > $ sudo service ceph start osd.20 > > From > https://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-journal-failure/ > > Which they all started without a problem. > > > On Sun, 2 Sep 2018 at 15:43, David Turner <drakonst...@gmail.com> wrote: > >> It looks like osds on the first failed node are having problems. What >> commands did you run to bring it back online? >> >> On Sun, Sep 2, 2018, 10:27 AM Lee <lqui...@gmail.com> wrote: >> >>> Ok I have a lot in the health detail... >>> >>> root@node31-a4:~# ceph health detail >>> HEALTH_ERR 64 pgs backfill; 27 pgs backfill_toofull; 39 pgs backfilling; >>> 26 pgs degraded; 4 pgs down; 31 pgs incomplete; 1 pgs inconsistent; 12 pgs >>> recovery_wait; 1 pgs stale; 26 pgs stuck degraded; 31 pgs stuck inactive; 1 >>> pgs stuck stale; 161 pgs stuck unclean; 9 pgs stuck undersized; 9 pgs >>> undersized; 726 requests are blocked > 32 sec; 9 osds have slow requests; >>> recovery 59636/5032695 objects degraded (1.185%); recovery 1280976/5032695 >>> objects misplaced (25.453%); 1 scrub errors; noscrub,nodeep-scrub flag(s) >>> set >>> pg 2.2a is stuck inactive for 97629.478505, current state incomplete, >>> last acting [24,5] >>> pg 2.b0 is stuck inactive for 98000.688979, current state incomplete, >>> last acting [24,7] >>> pg 9.42 is stuck inactive for 108836.103738, current state incomplete, >>> last acting [31,12] >>> pg 9.de is stuck inactive since forever, current state incomplete, last >>> acting [6,5] >>> pg 2.75 is stuck inactive since forever, current state down+incomplete, >>> last acting [7,15] >>> pg 9.dc is stuck inactive for 113491.800208, current state incomplete, >>> last acting [6,7] >>> pg 2.74 is stuck inactive for 97658.382960, current state incomplete, >>> last acting [13,5] >>> pg 9.1e is stuck inactive since forever, current state incomplete, last >>> acting [7,15] >>> pg 2.15 is stuck inactive since forever, current state incomplete, last >>> acting [7,31] >>> pg 11.1c is stuck inactive since forever, current state down+incomplete, >>> last acting [6,7] >>> pg 2.a1 is stuck inactive for 98785.888826, current state incomplete, >>> last acting [14,12] >>> pg 9.d8 is stuck inactive for 115082.575098, current state >>> down+incomplete, last acting [21,5] >>> pg 9.a8 is stuck inactive for 118575.035210, current state incomplete, >>> last acting [14,7] >>> pg 9.78 is stuck inactive since forever, current state incomplete, last >>> acting [5,24] >>> pg 2.a2 is stuck inactive since forever, current state incomplete, last >>> acting [5,13] >>> pg 7.16 is stuck inactive since forever, current state incomplete, last >>> acting [6,7] >>> pg 2.13 is stuck inactive since forever, current state incomplete, last >>> acting [7,10] >>> pg 9.f5 is stuck inactive for 103009.439003, current state incomplete, >>> last acting [18,5] >>> pg 2.d is stuck inactive since forever, current state incomplete, last >>> acting [5,10] >>> pg 9.5 is stuck inactive since forever, current state incomplete, last >>> acting [5,18] >>> pg 9.3 is stuck inactive since forever, current state incomplete, last >>> acting [7,15] >>> pg 9.fc is stuck inactive for 201476.092908, current state incomplete, >>> last acting [13,5] >>> pg 11.33 is stuck inactive since forever, current state down+incomplete, >>> last acting [7,6] >>> pg 9.3f is stuck inactive since forever, current state incomplete, last >>> acting [5,14] >>> pg 9.a is stuck inactive for 113328.467457, current state incomplete, >>> last acting [18,7] >>> pg 2.63 is stuck inactive for 97665.176520, current state incomplete, >>> last acting [31,7] >>> pg 2.3 is stuck inactive for 97655.279670, current state incomplete, >>> last acting [14,5] >>> pg 2.32 is stuck inactive since forever, current state incomplete, last >>> acting [5,13] >>> pg 2.bf is stuck inactive for 99913.875808, current state incomplete, >>> last acting [15,7] >>> pg 9.26 is stuck inactive since forever, current state incomplete, last >>> acting [5,24] >>> pg 9.22 is stuck inactive since forever, current state incomplete, last >>> acting [7,24] >>> pg 9.25 is stuck unclean for 20091.777921, current state >>> active+degraded+remapped+wait_backfill, last acting [15,2] >>> pg 7.2b is stuck unclean for 98830.660179, current state >>> stale+active+undersized+degraded, last acting [5] >>> pg 11.27 is stuck unclean for 1777813.502308, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [4,36] >>> pg 2.f1 is stuck unclean for 26585.481715, current state >>> active+recovery_wait+degraded, last acting [13,8] >>> pg 9.22 is stuck unclean since forever, current state incomplete, last >>> acting [7,24] >>> pg 2.29 is stuck unclean for 5629.190514, current state >>> active+remapped+wait_backfill, last acting [24,40] >>> pg 9.fb is stuck unclean for 3640.777545, current state >>> active+remapped+wait_backfill, last acting [8,39] >>> pg 9.23 is stuck unclean for 3595.306511, current state >>> active+remapped+wait_backfill, last acting [35,9] >>> pg 2.f3 is stuck unclean for 4993.558900, current state >>> active+remapped+wait_backfill, last acting [6,9] >>> pg 2.f2 is stuck unclean for 8871.835444, current state >>> active+recovery_wait+degraded, last acting [6,4] >>> pg 2.2a is stuck unclean for 97629.478922, current state incomplete, >>> last acting [24,5] >>> pg 2.ed is stuck unclean for 3595.395657, current state >>> active+remapped+backfilling, last acting [9,40] >>> pg 2.24 is stuck unclean for 6391.873856, current state >>> active+remapped+wait_backfill, last acting [13,40] >>> pg 2.27 is stuck unclean for 6814.809178, current state >>> active+recovery_wait+degraded, last acting [13,3] >>> pg 2.e8 is stuck unclean for 11759.373756, current state >>> active+remapped+wait_backfill, last acting [15,36] >>> pg 11.29 is stuck unclean for 6907.684021, current state >>> active+remapped+wait_backfill, last acting [14,40] >>> pg 2.eb is stuck unclean for 14474.951608, current state >>> active+remapped+backfilling, last acting [0,31] >>> pg 2.ea is stuck unclean for 3595.396597, current state >>> active+remapped+backfilling, last acting [9,34] >>> pg 12.13 is stuck unclean for 5629.177184, current state >>> active+remapped, last acting [8,31] >>> pg 2.1d is stuck unclean for 12245.891518, current state >>> active+remapped+backfilling, last acting [3,6] >>> pg 11.15 is stuck unclean for 14683.173113, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [34,9] >>> pg 2.1c is stuck unclean for 14683.755228, current state >>> active+degraded+remapped+backfilling, last acting [14,11] >>> pg 11.16 is stuck unclean for 5629.180301, current state >>> active+remapped+wait_backfill, last acting [15,40] >>> pg 2.1f is stuck unclean for 11858.149360, current state >>> active+remapped+wait_backfill, last acting [15,3] >>> pg 0.1c is stuck unclean for 6907.683196, current state >>> active+remapped+wait_backfill, last acting [12,3] >>> pg 2.1e is stuck unclean for 102531.318993, current state >>> active+undersized+degraded+remapped+backfilling, last acting [13] >>> pg 2.e0 is stuck unclean for 3571.898995, current state >>> active+remapped+inconsistent+wait_backfill, last acting [6,9] >>> pg 2.18 is stuck unclean for 3502.358091, current state >>> active+remapped+backfilling, last acting [18,9] >>> pg 2.e3 is stuck unclean for 12047.716242, current state >>> active+remapped+backfilling, last acting [4,41] >>> pg 11.13 is stuck unclean for 6907.682681, current state >>> active+remapped+wait_backfill, last acting [14,8] >>> pg 9.d6 is stuck unclean for 7416.596559, current state >>> active+remapped+wait_backfill, last acting [1,9] >>> pg 9.1e is stuck unclean since forever, current state incomplete, last >>> acting [7,15] >>> pg 11.1c is stuck unclean since forever, current state down+incomplete, >>> last acting [6,7] >>> pg 2.15 is stuck unclean since forever, current state incomplete, last >>> acting [7,31] >>> pg 2.dc is stuck unclean for 11709.774640, current state >>> active+remapped+backfilling, last acting [40,4] >>> pg 2.14 is stuck unclean for 3504.589025, current state >>> active+remapped+backfilling, last acting [18,9] >>> pg 2.df is stuck unclean for 5047.489499, current state >>> active+remapped+wait_backfill, last acting [0,13] >>> pg 11.1e is stuck unclean for 1968924.322629, current state >>> active+remapped+wait_backfill, last acting [3,38] >>> pg 2.de is stuck unclean for 97621.617826, current state >>> active+undersized+degraded+remapped+backfilling, last acting [3] >>> pg 9.1d is stuck unclean for 48349.818420, current state >>> active+remapped+backfill_toofull, last acting [12,36] >>> pg 3.17 is stuck unclean for 5629.187939, current state active+remapped, >>> last acting [5,13] >>> pg 2.d8 is stuck unclean for 7418.583365, current state >>> active+remapped+backfilling, last acting [21,41] >>> pg 7.15 is stuck unclean for 98830.449502, current state >>> active+remapped+wait_backfill, last acting [13,2] >>> pg 11.19 is stuck unclean for 3925.828027, current state >>> active+remapped+wait_backfill, last acting [15,38] >>> pg 2.db is stuck unclean for 3595.396853, current state >>> active+remapped+backfilling, last acting [9,40] >>> pg 9.18 is stuck unclean for 27500.110917, current state >>> active+remapped+backfill_toofull, last acting [18,13] >>> pg 7.16 is stuck unclean since forever, current state incomplete, last >>> acting [6,7] >>> pg 2.13 is stuck unclean since forever, current state incomplete, last >>> acting [7,10] >>> pg 9.de is stuck unclean since forever, current state incomplete, last >>> acting [6,5] >>> pg 9.6 is stuck unclean for 219342.087677, current state >>> active+remapped+backfill_toofull, last acting [2,41] >>> pg 2.d is stuck unclean since forever, current state incomplete, last >>> acting [5,10] >>> pg 9.df is stuck unclean for 48360.843924, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [35,2] >>> pg 8.6 is stuck unclean for 5629.183555, current state active+remapped, >>> last acting [12,13] >>> pg 2.d7 is stuck unclean for 83782.680541, current state >>> active+undersized+degraded+remapped+backfilling, last acting [36] >>> pg 9.dc is stuck unclean for 113491.800754, current state incomplete, >>> last acting [6,7] >>> pg 7.a is stuck unclean for 3844.286529, current state >>> active+remapped+wait_backfill, last acting [38,2] >>> pg 9.5 is stuck unclean since forever, current state incomplete, last >>> acting [5,18] >>> pg 4.8 is stuck unclean for 3893.186289, current state >>> active+recovery_wait+degraded, last acting [15,2] >>> pg 3.d0 is stuck unclean for 7418.584435, current state >>> active+remapped+wait_backfill, last acting [12,2] >>> pg 2.d1 is stuck unclean for 83769.259615, current state >>> active+undersized+degraded+remapped+backfill_toofull, last acting [36] >>> pg 9.3 is stuck unclean since forever, current state incomplete, last >>> acting [7,15] >>> pg 9.d8 is stuck unclean for 115082.575647, current state >>> down+incomplete, last acting [21,5] >>> pg 2.b is stuck unclean for 7418.564413, current state >>> active+remapped+backfilling, last acting [40,24] >>> pg 9.d9 is stuck unclean for 14681.601684, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [39,4] >>> pg 9.1 is stuck unclean for 3930.973909, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [39,3] >>> pg 2.cc is stuck unclean for 5078.643356, current state active+remapped, >>> last acting [40,24] >>> pg 11.d is stuck unclean for 14592.297817, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [36,4] >>> pg 9.c5 is stuck unclean for 3844.281162, current state >>> active+remapped+wait_backfill, last acting [5,38] >>> pg 9.a is stuck unclean for 113328.467988, current state incomplete, >>> last acting [18,7] >>> pg 11.9 is stuck unclean for 7418.578072, current state >>> active+remapped+wait_backfill, last acting [21,39] >>> pg 2.0 is stuck unclean for 97873.488751, current state >>> active+undersized+degraded+remapped+wait_backfill+backfill_toofull, last >>> acting [1] >>> pg 2.cb is stuck unclean for 25031.035830, current state >>> active+degraded+remapped+wait_backfill+backfill_toofull, last acting [1,4] >>> pg 9.8 is stuck unclean for 24341.317696, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [5,24] >>> pg 2.3 is stuck unclean for 97655.280232, current state incomplete, last >>> acting [14,5] >>> pg 2.2 is stuck unclean for 97734.492834, current state >>> active+recovery_wait+degraded+remapped, last acting [13,9] >>> pg 2.c4 is stuck unclean for 3595.525931, current state >>> active+remapped+backfilling, last acting [34,9] >>> pg 2.c7 is stuck unclean for 8871.729496, current state >>> active+recovery_wait+degraded, last acting [13,2] >>> pg 9.cb is stuck unclean for 5629.175300, current state active+remapped, >>> last acting [11,31] >>> pg 9.c9 is stuck unclean for 14683.752701, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [5,34] >>> pg 2.c2 is stuck unclean for 3504.738005, current state >>> active+remapped+wait_backfill, last acting [9,15] >>> pg 2.bd is stuck unclean for 3571.325492, current state >>> active+remapped+backfilling, last acting [39,9] >>> pg 2.bf is stuck unclean for 99913.876400, current state incomplete, >>> last acting [15,7] >>> pg 9.b3 is stuck unclean for 3925.828356, current state >>> active+remapped+wait_backfill, last acting [15,35] >>> pg 2.b5 is stuck unclean for 28026.340079, current state >>> active+remapped, last acting [2,40] >>> pg 2.b6 is stuck unclean for 11859.834286, current state >>> active+remapped+backfilling, last acting [1,31] >>> pg 2.b0 is stuck unclean for 98000.689674, current state incomplete, >>> last acting [24,7] >>> pg 2.b3 is stuck unclean for 5629.182841, current state >>> active+remapped+backfilling, last acting [3,0] >>> pg 2.ad is stuck unclean for 6907.677050, current state >>> active+remapped+backfilling, last acting [2,39] >>> pg 2.ae is stuck unclean for 11862.967346, current state >>> active+remapped+backfilling, last acting [34,13] >>> pg 9.a0 is stuck unclean for 14683.746136, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [1,3] >>> pg 2.aa is stuck unclean for 3571.307756, current state >>> active+remapped+backfilling, last acting [40,9] >>> pg 2.a7 is stuck unclean for 25030.658836, current state >>> active+remapped+wait_backfill, last acting [2,1] >>> pg 2.a6 is stuck unclean for 3930.913873, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [2,35] >>> pg 9.ad is stuck unclean for 8871.819919, current state >>> active+recovery_wait+degraded, last acting [6,8] >>> pg 2.a1 is stuck unclean for 98785.889529, current state incomplete, >>> last acting [14,12] >>> pg 1.a0 is stuck unclean for 5629.186426, current state active+remapped, >>> last acting [5,40] >>> pg 9.a8 is stuck unclean for 118575.035913, current state incomplete, >>> last acting [14,7] >>> pg 2.a2 is stuck unclean since forever, current state incomplete, last >>> acting [5,13] >>> pg 2.9d is stuck unclean for 11861.496234, current state >>> active+remapped+backfilling, last acting [6,38] >>> pg 2.9c is stuck unclean for 3506.888979, current state >>> active+remapped+wait_backfill, last acting [35,11] >>> pg 2.9b is stuck unclean for 5629.183979, current state >>> active+remapped+wait_backfill, last acting [6,0] >>> pg 9.91 is stuck unclean for 85752.028652, current state >>> active+remapped+wait_backfill, last acting [31,9] >>> pg 2.97 is stuck unclean for 9736.783735, current state >>> active+remapped+backfilling, last acting [35,24] >>> pg 2.91 is stuck unclean for 28553.979772, current state >>> active+remapped+backfilling, last acting [0,24] >>> pg 2.90 is stuck unclean for 30364.623932, current state >>> active+degraded+remapped+backfill_toofull, last acting [41,24] >>> pg 2.92 is stuck unclean for 25031.211566, current state >>> active+undersized+degraded+remapped+backfilling, last acting [8] >>> pg 9.99 is stuck unclean for 11862.827419, current state >>> active+remapped+wait_backfill, last acting [13,4] >>> pg 2.8f is stuck unclean for 17426.148382, current state >>> active+remapped+wait_backfill, last acting [15,9] >>> pg 2.88 is stuck unclean for 3591.054564, current state >>> active+remapped+wait_backfill, last acting [14,9] >>> pg 9.8f is stuck unclean for 3595.395794, current state >>> active+remapped+wait_backfill, last acting [9,15] >>> pg 2.87 is stuck unclean for 3844.271547, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [1,2] >>> pg 2.81 is stuck unclean for 83759.347793, current state >>> active+undersized+degraded+remapped+wait_backfill, last acting [39] >>> pg 9.8a is stuck unclean for 27697.026446, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [12,1] >>> pg 2.79 is stuck unclean for 12137.676488, current state >>> active+remapped+backfilling, last acting [7,40] >>> pg 2.78 is stuck unclean for 29127.120125, current state >>> active+remapped+backfilling, last acting [0,6] >>> pg 2.75 is stuck unclean since forever, current state down+incomplete, >>> last acting [7,15] >>> pg 2.74 is stuck unclean for 97658.383751, current state incomplete, >>> last acting [13,5] >>> pg 9.7c is stuck unclean for 114170.469704, current state >>> active+undersized+degraded+remapped+wait_backfill, last acting [39] >>> pg 9.7d is stuck unclean for 14077.123326, current state >>> active+remapped+backfilling, last acting [5,24] >>> pg 2.71 is stuck unclean for 11859.344208, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [21,3] >>> pg 2.73 is stuck unclean for 11859.417605, current state >>> active+remapped+backfilling, last acting [39,15] >>> pg 9.78 is stuck unclean since forever, current state incomplete, last >>> acting [5,24] >>> pg 9.79 is stuck unclean for 14595.569162, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [39,3] >>> pg 2.6d is stuck unclean for 27802.265038, current state >>> active+remapped+backfilling, last acting [4,13] >>> pg 9.62 is stuck unclean for 25030.488507, current state >>> active+remapped+backfill_toofull, last acting [36,2] >>> pg 2.6a is stuck unclean for 20323.517565, current state >>> active+remapped+wait_backfill, last acting [6,40] >>> pg 9.6c is stuck unclean for 14234.077824, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [41,2] >>> pg 9.6a is stuck unclean for 27035.043476, current state >>> active+remapped+backfill_toofull, last acting [36,4] >>> pg 2.63 is stuck unclean for 97665.177288, current state incomplete, >>> last acting [31,7] >>> pg 2.5d is stuck unclean for 3549.763078, current state >>> active+remapped+wait_backfill, last acting [9,34] >>> pg 2.5e is stuck unclean for 97736.064280, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [35,36] >>> pg 2.52 is stuck unclean for 8871.832670, current state >>> active+recovery_wait+degraded, last acting [6,4] >>> pg 9.59 is stuck unclean for 26868.986032, current state >>> active+remapped+wait_backfill, last acting [31,34] >>> pg 2.4f is stuck unclean for 12108.325792, current state >>> active+remapped+backfilling, last acting [11,40] >>> pg 2.49 is stuck unclean for 30446.302835, current state >>> active+remapped+wait_backfill, last acting [9,24] >>> pg 9.42 is stuck unclean for 108836.104626, current state incomplete, >>> last acting [31,12] >>> pg 2.45 is stuck unclean for 11284.580305, current state >>> active+degraded+remapped+backfilling, last acting [24,2] >>> pg 9.4f is stuck unclean for 3893.672356, current state >>> active+remapped+wait_backfill, last acting [0,21] >>> pg 2.44 is stuck unclean for 27623.439527, current state >>> active+recovery_wait+degraded+remapped, last acting [6,11] >>> pg 9.4c is stuck unclean for 6907.681859, current state >>> active+remapped+wait_backfill, last acting [15,36] >>> pg 2.46 is stuck unclean for 6907.682263, current state >>> active+remapped+backfilling, last acting [11,24] >>> pg 9.49 is stuck unclean for 14683.624639, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [2,31] >>> pg 11.35 is stuck unclean for 5872394.444913, current state >>> active+remapped+wait_backfill, last acting [40,36] >>> pg 2.3e is stuck unclean for 6907.683506, current state >>> active+remapped+backfilling, last acting [4,41] >>> pg 2.38 is stuck unclean for 5140.320861, current state >>> active+remapped+wait_backfill, last acting [0,5] >>> pg 2.3b is stuck unclean for 14456.624593, current state >>> active+remapped+wait_backfill+backfill_toofull, last acting [18,2] >>> pg 11.33 is stuck unclean since forever, current state down+incomplete, >>> last acting [7,6] >>> pg 10.3d is stuck unclean for 3595.395921, current state >>> active+remapped+wait_backfill, last acting [9,36] >>> pg 2.35 is stuck unclean for 8872.226171, current state >>> active+recovery_wait+degraded, last acting [6,11] >>> pg 2.fc is stuck unclean for 5820.330202, current state >>> active+remapped+backfilling, last acting [31,0] >>> pg 9.3f is stuck unclean since forever, current state incomplete, last >>> acting [5,14] >>> pg 2.ff is stuck unclean for 3595.396088, current state >>> active+remapped+backfilling, last acting [9,39] >>> pg 2.fe is stuck unclean for 6904.439076, current state >>> active+remapped+backfilling, last acting [21,0] >>> pg 9.f5 is stuck unclean for 103009.439909, current state incomplete, >>> last acting [18,5] >>> pg 7.34 is stuck unclean for 3886.510000, current state >>> active+remapped+wait_backfill, last acting [13,39] >>> pg 2.fb is stuck unclean for 57173.985429, current state >>> active+recovery_wait+degraded+remapped, last acting [6,8] >>> pg 2.32 is stuck unclean since forever, current state incomplete, last >>> acting [5,13] >>> pg 9.fe is stuck unclean for 7418.564930, current state >>> active+recovery_wait+degraded+remapped, last acting [6,3] >>> pg 9.26 is stuck unclean since forever, current state incomplete, last >>> acting [5,24] >>> pg 2.f7 is stuck unclean for 6915.532617, current state >>> active+remapped+backfilling, last acting [4,15] >>> pg 9.fc is stuck unclean for 201476.093824, current state incomplete, >>> last acting [13,5] >>> pg 7.2b is stuck undersized for 64282.169836, current state >>> stale+active+undersized+degraded, last acting [5] >>> pg 2.1e is stuck undersized for 3895.207475, current state >>> active+undersized+degraded+remapped+backfilling, last acting [13] >>> pg 2.de is stuck undersized for 3886.529396, current state >>> active+undersized+degraded+remapped+backfilling, last acting [3] >>> pg 2.d7 is stuck undersized for 7417.316099, current state >>> active+undersized+degraded+remapped+backfilling, last acting [36] >>> pg 2.d1 is stuck undersized for 6903.297196, current state >>> active+undersized+degraded+remapped+backfill_toofull, last acting [36] >>> pg 2.0 is stuck undersized for 4999.401505, current state >>> active+undersized+degraded+remapped+wait_backfill+backfill_toofull, last >>> acting [1] >>> pg 2.92 is stuck undersized for 4999.406547, current state >>> active+undersized+degraded+remapped+backfilling, last acting [8] >>> pg 2.81 is stuck undersized for 7417.378668, current state >>> active+undersized+degraded+remapped+wait_backfill, last acting [39] >>> pg 9.7c is stuck undersized for 3894.953894, current state >>> active+undersized+degraded+remapped+wait_backfill, last acting [39] >>> pg 9.25 is stuck degraded for 7413.083043, current state >>> active+degraded+remapped+wait_backfill, last acting [15,2] >>> pg 7.2b is stuck degraded for 64282.169913, current state >>> stale+active+undersized+degraded, last acting [5] >>> pg 2.f1 is stuck degraded for 3848.032008, current state >>> active+recovery_wait+degraded, last acting [13,8] >>> pg 2.f2 is stuck degraded for 7411.108195, current state >>> active+recovery_wait+degraded, last acting [6,4] >>> pg 2.27 is stuck degraded for 3893.230317, current state >>> active+recovery_wait+degraded, last acting [13,3] >>> pg 2.1c is stuck degraded for 7414.316299, current state >>> active+degraded+remapped+backfilling, last acting [14,11] >>> pg 2.1e is stuck degraded for 3895.207564, current state >>> active+undersized+degraded+remapped+backfilling, last acting [13] >>> pg 2.de is stuck degraded for 3886.529484, current state >>> active+undersized+degraded+remapped+backfilling, last acting [3] >>> pg 2.d7 is stuck degraded for 7417.316187, current state >>> active+undersized+degraded+remapped+backfilling, last acting [36] >>> pg 4.8 is stuck degraded for 3490.406821, current state >>> active+recovery_wait+degraded, last acting [15,2] >>> pg 2.d1 is stuck degraded for 6903.297288, current state >>> active+undersized+degraded+remapped+backfill_toofull, last acting [36] >>> pg 2.0 is stuck degraded for 4999.401597, current state >>> active+undersized+degraded+remapped+wait_backfill+backfill_toofull, last >>> acting [1] >>> pg 2.cb is stuck degraded for 7413.316930, current state >>> active+degraded+remapped+wait_backfill+backfill_toofull, last acting [1,4] >>> pg 2.2 is stuck degraded for 3894.930841, current state >>> active+recovery_wait+degraded+remapped, last acting [13,9] >>> pg 2.c7 is stuck degraded for 3886.500328, current state >>> active+recovery_wait+degraded, last acting [13,2] >>> pg 9.ad is stuck degraded for 7411.181412, current state >>> active+recovery_wait+degraded, last acting [6,8] >>> pg 2.90 is stuck degraded for 3893.715235, current state >>> active+degraded+remapped+backfill_toofull, last acting [41,24] >>> pg 2.92 is stuck degraded for 4999.406655, current state >>> active+undersized+degraded+remapped+backfilling, last acting [8] >>> pg 2.81 is stuck degraded for 7417.378776, current state >>> active+undersized+degraded+remapped+wait_backfill, last acting [39] >>> pg 9.7c is stuck degraded for 3894.954001, current state >>> active+undersized+degraded+remapped+wait_backfill, last acting [39] >>> pg 2.52 is stuck degraded for 7411.108431, current state >>> active+recovery_wait+degraded, last acting [6,4] >>> pg 2.45 is stuck degraded for 3892.755878, current state >>> active+degraded+remapped+backfilling, last acting [24,2] >>> pg 2.44 is stuck degraded for 7411.213966, current state >>> active+recovery_wait+degraded+remapped, last acting [6,11] >>> pg 2.35 is stuck degraded for 7411.295348, current state >>> active+recovery_wait+degraded, last acting [6,11] >>> pg 2.fb is stuck degraded for 6903.301076, current state >>> active+recovery_wait+degraded+remapped, last acting [6,8] >>> pg 9.fe is stuck degraded for 7413.453955, current state >>> active+recovery_wait+degraded+remapped, last acting [6,3] >>> pg 7.2b is stuck stale for 64232.262041, current state >>> stale+active+undersized+degraded, last acting [5] >>> pg 2.fc is active+remapped+backfilling, acting [31,0] >>> pg 2.ff is active+remapped+backfilling, acting [9,39] >>> pg 9.f5 is incomplete, acting [18,5] >>> pg 2.fe is active+remapped+backfilling, acting [21,0] >>> pg 2.fb is active+recovery_wait+degraded+remapped, acting [6,8] >>> pg 9.fe is active+recovery_wait+degraded+remapped, acting [6,3] >>> pg 9.fc is incomplete, acting [13,5] >>> pg 2.f7 is active+remapped+backfilling, acting [4,15] >>> pg 2.f1 is active+recovery_wait+degraded, acting [13,8] >>> pg 9.fb is active+remapped+wait_backfill, acting [8,39] >>> pg 2.f3 is active+remapped+wait_backfill, acting [6,9] >>> pg 2.f2 is active+recovery_wait+degraded, acting [6,4] >>> pg 2.ed is active+remapped+backfilling, acting [9,40] >>> pg 2.e8 is active+remapped+wait_backfill, acting [15,36] >>> pg 2.eb is active+remapped+backfilling, acting [0,31] >>> pg 2.ea is active+remapped+backfilling, acting [9,34] >>> pg 2.e0 is active+remapped+inconsistent+wait_backfill, acting [6,9] >>> pg 2.e3 is active+remapped+backfilling, acting [4,41] >>> pg 9.d6 is active+remapped+wait_backfill, acting [1,9] >>> pg 2.dc is active+remapped+backfilling, acting [40,4] >>> pg 2.df is active+remapped+wait_backfill, acting [0,13] >>> pg 2.de is active+undersized+degraded+remapped+backfilling, acting [3] >>> pg 2.d8 is active+remapped+backfilling, acting [21,41] >>> pg 2.db is active+remapped+backfilling, acting [9,40] >>> pg 9.de is incomplete, acting [6,5] >>> pg 9.df is active+remapped+wait_backfill+backfill_toofull, acting [35,2] >>> pg 9.dc is incomplete, acting [6,7] >>> pg 2.d7 is active+undersized+degraded+remapped+backfilling, acting [36] >>> pg 2.d1 is active+undersized+degraded+remapped+backfill_toofull, acting >>> [36] >>> pg 3.d0 is active+remapped+wait_backfill, acting [12,2] >>> pg 9.d8 is down+incomplete, acting [21,5] >>> pg 9.d9 is active+remapped+wait_backfill+backfill_toofull, acting [39,4] >>> pg 9.c5 is active+remapped+wait_backfill, acting [5,38] >>> pg 2.cb is active+degraded+remapped+wait_backfill+backfill_toofull, >>> acting [1,4] >>> pg 2.c4 is active+remapped+backfilling, acting [34,9] >>> pg 2.c7 is active+recovery_wait+degraded, acting [13,2] >>> pg 2.c2 is active+remapped+wait_backfill, acting [9,15] >>> pg 9.c9 is active+remapped+wait_backfill+backfill_toofull, acting [5,34] >>> pg 2.bd is active+remapped+backfilling, acting [39,9] >>> pg 2.bf is incomplete, acting [15,7] >>> pg 9.b3 is active+remapped+wait_backfill, acting [15,35] >>> pg 2.b6 is active+remapped+backfilling, acting [1,31] >>> pg 2.b0 is incomplete, acting [24,7] >>> pg 2.b3 is active+remapped+backfilling, acting [3,0] >>> pg 2.ad is active+remapped+backfilling, acting [2,39] >>> pg 2.ae is active+remapped+backfilling, acting [34,13] >>> pg 9.a0 is active+remapped+wait_backfill+backfill_toofull, acting [1,3] >>> pg 2.aa is active+remapped+backfilling, acting [40,9] >>> pg 2.a7 is active+remapped+wait_backfill, acting [2,1] >>> pg 2.a6 is active+remapped+wait_backfill+backfill_toofull, acting [2,35] >>> pg 9.ad is active+recovery_wait+degraded, acting [6,8] >>> pg 2.a1 is incomplete, acting [14,12] >>> pg 9.a8 is incomplete, acting [14,7] >>> pg 2.a2 is incomplete, acting [5,13] >>> pg 2.9d is active+remapped+backfilling, acting [6,38] >>> pg 2.9c is active+remapped+wait_backfill, acting [35,11] >>> pg 2.9b is active+remapped+wait_backfill, acting [6,0] >>> pg 9.91 is active+remapped+wait_backfill, acting [31,9] >>> pg 2.97 is active+remapped+backfilling, acting [35,24] >>> pg 2.91 is active+remapped+backfilling, acting [0,24] >>> pg 2.90 is active+degraded+remapped+backfill_toofull, acting [41,24] >>> pg 2.92 is active+undersized+degraded+remapped+backfilling, acting [8] >>> pg 9.99 is active+remapped+wait_backfill, acting [13,4] >>> pg 2.8f is active+remapped+wait_backfill, acting [15,9] >>> pg 2.88 is active+remapped+wait_backfill, acting [14,9] >>> pg 9.8f is active+remapped+wait_backfill, acting [9,15] >>> pg 2.87 is active+remapped+wait_backfill+backfill_toofull, acting [1,2] >>> pg 2.81 is active+undersized+degraded+remapped+wait_backfill, acting [39] >>> pg 9.8a is active+remapped+wait_backfill+backfill_toofull, acting [12,1] >>> pg 2.79 is active+remapped+backfilling, acting [7,40] >>> pg 2.78 is active+remapped+backfilling, acting [0,6] >>> pg 2.75 is down+incomplete, acting [7,15] >>> pg 2.74 is incomplete, acting [13,5] >>> pg 9.7c is active+undersized+degraded+remapped+wait_backfill, acting [39] >>> pg 9.7d is active+remapped+backfilling, acting [5,24] >>> pg 2.71 is active+remapped+wait_backfill+backfill_toofull, acting [21,3] >>> pg 2.73 is active+remapped+backfilling, acting [39,15] >>> pg 9.78 is incomplete, acting [5,24] >>> pg 9.79 is active+remapped+wait_backfill+backfill_toofull, acting [39,3] >>> pg 2.6d is active+remapped+backfilling, acting [4,13] >>> pg 9.62 is active+remapped+backfill_toofull, acting [36,2] >>> pg 2.6a is active+remapped+wait_backfill, acting [6,40] >>> pg 9.6c is active+remapped+wait_backfill+backfill_toofull, acting [41,2] >>> pg 9.6a is active+remapped+backfill_toofull, acting [36,4] >>> pg 2.63 is incomplete, acting [31,7] >>> pg 2.5d is active+remapped+wait_backfill, acting [9,34] >>> pg 2.5e is active+remapped+wait_backfill+backfill_toofull, acting [35,36] >>> pg 2.52 is active+recovery_wait+degraded, acting [6,4] >>> pg 9.59 is active+remapped+wait_backfill, acting [31,34] >>> pg 2.4f is active+remapped+backfilling, acting [11,40] >>> pg 2.49 is active+remapped+wait_backfill, acting [9,24] >>> pg 9.42 is incomplete, acting [31,12] >>> pg 2.45 is active+degraded+remapped+backfilling, acting [24,2] >>> pg 2.44 is active+recovery_wait+degraded+remapped, acting [6,11] >>> pg 9.4f is active+remapped+wait_backfill, acting [0,21] >>> pg 9.4c is active+remapped+wait_backfill, acting [15,36] >>> pg 2.46 is active+remapped+backfilling, acting [11,24] >>> pg 9.49 is active+remapped+wait_backfill+backfill_toofull, acting [2,31] >>> pg 11.35 is active+remapped+wait_backfill, acting [40,36] >>> pg 2.3e is active+remapped+backfilling, acting [4,41] >>> pg 2.38 is active+remapped+wait_backfill, acting [0,5] >>> pg 2.3b is active+remapped+wait_backfill+backfill_toofull, acting [18,2] >>> pg 11.33 is down+incomplete, acting [7,6] >>> pg 2.35 is active+recovery_wait+degraded, acting [6,11] >>> pg 10.3d is active+remapped+wait_backfill, acting [9,36] >>> pg 9.3f is incomplete, acting [5,14] >>> pg 7.34 is active+remapped+wait_backfill, acting [13,39] >>> pg 2.32 is incomplete, acting [5,13] >>> pg 9.26 is incomplete, acting [5,24] >>> pg 11.27 is active+remapped+wait_backfill+backfill_toofull, acting [4,36] >>> pg 9.25 is active+degraded+remapped+wait_backfill, acting [15,2] >>> pg 2.29 is active+remapped+wait_backfill, acting [24,40] >>> pg 9.22 is incomplete, acting [7,24] >>> pg 9.23 is active+remapped+wait_backfill, acting [35,9] >>> pg 2.2a is incomplete, acting [24,5] >>> pg 2.24 is active+remapped+wait_backfill, acting [13,40] >>> pg 2.27 is active+recovery_wait+degraded, acting [13,3] >>> pg 11.29 is active+remapped+wait_backfill, acting [14,40] >>> pg 2.1d is active+remapped+backfilling, acting [3,6] >>> pg 2.1c is active+degraded+remapped+backfilling, acting [14,11] >>> pg 11.15 is active+remapped+wait_backfill+backfill_toofull, acting [34,9] >>> pg 2.1f is active+remapped+wait_backfill, acting [15,3] >>> pg 11.16 is active+remapped+wait_backfill, acting [15,40] >>> pg 2.1e is active+undersized+degraded+remapped+backfilling, acting [13] >>> pg 0.1c is active+remapped+wait_backfill, acting [12,3] >>> pg 2.18 is active+remapped+backfilling, acting [18,9] >>> pg 11.13 is active+remapped+wait_backfill, acting [14,8] >>> pg 2.15 is incomplete, acting [7,31] >>> pg 11.1c is down+incomplete, acting [6,7] >>> pg 9.1e is incomplete, acting [7,15] >>> pg 2.14 is active+remapped+backfilling, acting [18,9] >>> pg 11.1e is active+remapped+wait_backfill, acting [3,38] >>> pg 9.1d is active+remapped+backfill_toofull, acting [12,36] >>> pg 11.19 is active+remapped+wait_backfill, acting [15,38] >>> pg 7.15 is active+remapped+wait_backfill, acting [13,2] >>> pg 2.13 is incomplete, acting [7,10] >>> pg 7.16 is incomplete, acting [6,7] >>> pg 9.18 is active+remapped+backfill_toofull, acting [18,13] >>> pg 2.d is incomplete, acting [5,10] >>> pg 9.6 is active+remapped+backfill_toofull, acting [2,41] >>> pg 7.a is active+remapped+wait_backfill, acting [38,2] >>> pg 4.8 is active+recovery_wait+degraded, acting [15,2] >>> pg 9.5 is incomplete, acting [5,18] >>> pg 9.3 is incomplete, acting [7,15] >>> pg 2.b is active+remapped+backfilling, acting [40,24] >>> pg 9.1 is active+remapped+wait_backfill+backfill_toofull, acting [39,3] >>> pg 11.d is active+remapped+wait_backfill+backfill_toofull, acting [36,4] >>> pg 9.a is incomplete, acting [18,7] >>> pg 2.0 is >>> active+undersized+degraded+remapped+wait_backfill+backfill_toofull, acting >>> [1] >>> pg 11.9 is active+remapped+wait_backfill, acting [21,39] >>> pg 2.3 is incomplete, acting [14,5] >>> pg 9.8 is active+remapped+wait_backfill+backfill_toofull, acting [5,24] >>> pg 2.2 is active+recovery_wait+degraded+remapped, acting [13,9] >>> 33 ops are blocked > 16777.2 sec >>> 368 ops are blocked > 8388.61 sec >>> 238 ops are blocked > 4194.3 sec >>> 87 ops are blocked > 1048.58 sec >>> 2 ops are blocked > 8388.61 sec on osd.5 >>> 98 ops are blocked > 4194.3 sec on osd.5 >>> 98 ops are blocked > 8388.61 sec on osd.6 >>> 1 ops are blocked > 8388.61 sec on osd.7 >>> 27 ops are blocked > 4194.3 sec on osd.7 >>> 12 ops are blocked > 4194.3 sec on osd.13 >>> 87 ops are blocked > 1048.58 sec on osd.13 >>> 2 ops are blocked > 16777.2 sec on osd.14 >>> 98 ops are blocked > 8388.61 sec on osd.14 >>> 3 ops are blocked > 16777.2 sec on osd.15 >>> 97 ops are blocked > 8388.61 sec on osd.15 >>> 1 ops are blocked > 4194.3 sec on osd.18 >>> 100 ops are blocked > 4194.3 sec on osd.24 >>> 28 ops are blocked > 16777.2 sec on osd.31 >>> 72 ops are blocked > 8388.61 sec on osd.31 >>> 9 osds have slow requests >>> recovery 59636/5032695 objects degraded (1.185%) >>> recovery 1280976/5032695 objects misplaced (25.453%) >>> 1 scrub errors >>> noscrub,nodeep-scrub flag(s) set >>> >>> >>> On the first failed host is 6, 13, 14, 15, 18, 24, 31 >>> >>> On the second host that went down was 5 and 7 >>> >>> >>> >>> On Sun, 2 Sep 2018 at 15:15, David Turner <drakonst...@gmail.com> wrote: >>> >>>> When the first node went offline with a dead SSD journal, all of the >>>> dates on the OSDs was useless. Unless you could flush the journals, you >>>> can't guarantee that a wire the cluster think happened actually made it to >>>> the disk. The proper procedure here is to remove those OSDs and add them >>>> again as new OSDs. >>>> >>>> `ceph health detail` will give you some more information on the blocked >>>> requests. Depending on what that shows you can often find the OSD that is >>>> causing the problems. But your biggest problem is that you have dishes >>>> with potentially inconsistent data in your closer. >>>> >>>> On Sun, Sep 2, 2018, 4:42 AM Lee <lqui...@gmail.com> wrote: >>>> >>>>> Running 0.94.5 as part of a Openstack enviroment, our ceph setup is 3x >>>>> OSD Nodes 3x MON Nodes, yesterday we had a aircon outage in our hosting >>>>> enviroment, 1 OSD node failed (offline with a the journal SSD dead) left >>>>> with 2 nodes running correctly, 2 hours later a second OSD node failed >>>>> complaining of readwrite errors to the physical drives, i assume this was >>>>> a >>>>> heat issue as when rebooted this came back online ok and ceph started to >>>>> repair itself. We have since brought the first failed node back on by >>>>> replacing the ssd and recreating the journals hoping it would all repair.. >>>>> Our pools are min 2 repl. >>>>> >>>>> The problem we have is client IO (read) is totally blocked, and when I >>>>> query the stuck PG's it just hangs.. >>>>> >>>>> For example the check version command just errors with: >>>>> >>>>> Error EINTR: problem getting command descriptions from on various >>>>> OSD's so I cannot even query the inactive PG's >>>>> >>>>> root@node31-a4:~# ceph -s >>>>> cluster 7c24e1b9-24b3-4a1b-8889-9b2d7fd88cd2 >>>>> health HEALTH_WARN >>>>> 83 pgs backfill >>>>> 2 pgs backfill_toofull >>>>> 3 pgs backfilling >>>>> 48 pgs degraded >>>>> 1 pgs down >>>>> 31 pgs incomplete >>>>> 1 pgs recovering >>>>> 29 pgs recovery_wait >>>>> 1 pgs stale >>>>> 48 pgs stuck degraded >>>>> 31 pgs stuck inactive >>>>> 1 pgs stuck stale >>>>> 148 pgs stuck unclean >>>>> 17 pgs stuck undersized >>>>> 17 pgs undersized >>>>> 599 requests are blocked > 32 sec >>>>> recovery 111489/4697618 objects degraded (2.373%) >>>>> recovery 772268/4697618 objects misplaced (16.440%) >>>>> recovery 1/2171314 unfound (0.000%) >>>>> monmap e5: 3 mons at {bc07s12-a7= >>>>> 172.27.16.11:6789/0,bc07s13-a7=172.27.16.21:6789/0,bc07s14-a7=172.27.16.15:6789/0 >>>>> } >>>>> election epoch 198, quorum 0,1,2 >>>>> bc07s12-a7,bc07s14-a7,bc07s13-a7 >>>>> osdmap e18727: 25 osds: 25 up, 25 in; 90 remapped pgs >>>>> pgmap v70996322: 1792 pgs, 13 pools, 8210 GB data, 2120 kobjects >>>>> 16783 GB used, 6487 GB / 23270 GB avail >>>>> 111489/4697618 objects degraded (2.373%) >>>>> 772268/4697618 objects misplaced (16.440%) >>>>> 1/2171314 unfound (0.000%) >>>>> 1639 active+clean >>>>> 66 active+remapped+wait_backfill >>>>> 30 incomplete >>>>> 25 active+recovery_wait+degraded >>>>> 15 active+undersized+degraded+remapped+wait_backfill >>>>> 4 active+recovery_wait+degraded+remapped >>>>> 4 active+clean+scrubbing >>>>> 2 active+remapped+wait_backfill+backfill_toofull >>>>> 1 down+incomplete >>>>> 1 active+remapped+backfilling >>>>> 1 active+clean+scrubbing+deep >>>>> 1 stale+active+undersized+degraded >>>>> 1 active+undersized+degraded+remapped+backfilling >>>>> 1 active+degraded+remapped+backfilling >>>>> 1 active+recovering+degraded >>>>> recovery io 29385 kB/s, 7 objects/s >>>>> client io 5877 B/s wr, 1 op/s >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com