Should I just out the OSD's first or completely zap them and recreate? Or
delete and let the cluster repair itself?

On the second node when it started back up I had problems with the Journals
for ID 5 and 7 they were also recreated all the rest are still the
originals.

I know that some PG's are on both 24 and 5 and 7 ie.

root@data33-a4:/var/log/ceph# ceph pg dump_stuck inactive
ok
pg_stat state   up      up_primary      acting  acting_primary
2.2a    incomplete      [24,5]  24      [24,5]  24
2.b0    incomplete      [24,7]  24      [24,7]  24
9.42    incomplete      [31,12] 31      [31,12] 31
9.de    incomplete      [6,5]   6       [6,5]   6
2.75    incomplete      [7,15]  7       [7,15]  7
9.dc    incomplete      [6,7]   6       [6,7]   6
2.74    incomplete      [13,5]  13      [13,5]  13
9.1e    incomplete      [7,15]  7       [7,15]  7
2.15    incomplete      [7,31]  7       [7,31]  7
11.1c   incomplete      [6,7]   6       [6,7]   6
2.a1    incomplete      [14,12] 14      [14,12] 14
9.d8    incomplete      [21,5]  21      [21,5]  21
9.a8    incomplete      [14,7]  14      [14,7]  14
9.78    incomplete      [5,24]  5       [5,24]  5
2.a2    incomplete      [5,13]  5       [5,13]  5
7.16    incomplete      [6,7]   6       [6,7]   6
2.13    incomplete      [7,10]  7       [7,10]  7
9.f5    incomplete      [18,5]  18      [18,5]  18
2.d     incomplete      [5,10]  5       [5,10]  5
9.5     incomplete      [5,18]  5       [5,18]  5
9.3     incomplete      [7,15]  7       [7,15]  7
9.fc    incomplete      [13,5]  13      [13,5]  13
11.33   down+incomplete [7,6]   7       [7,6]   7
9.3f    incomplete      [5,14]  5       [5,14]  5
9.a     incomplete      [18,7]  18      [18,7]  18
2.63    incomplete      [31,7]  31      [31,7]  31
2.3     incomplete      [14,5]  14      [14,5]  14
2.32    incomplete      [5,13]  5       [5,13]  5
2.bf    incomplete      [15,7]  15      [15,7]  15
9.26    incomplete      [5,24]  5       [5,24]  5
9.22    incomplete      [31,7]  31      [31,7]  31
root@data33-a4:/var/log/ceph#




On Sun, 2 Sep 2018 at 16:02, David Turner <drakonst...@gmail.com> wrote:

> The problem is with never getting a successful run of `ceph-osd
> --flush-journal` on the old SSD journal drive. All of the OSDs that used
> the dead journal need to be removed from the cluster, wiped, and added back
> in. The data on them is not 100% consistent because the old journal died.
> Any word that made it to the journal and not the disk is bad.
>
> Add on top of that your decision to run with replica size = 2 min_size =
> 1, anything that happens in your cluster becomes very dangerous for data
> loss. Seeing as you had 2 nodes sure near each other, there is a very real
> possibility that you will have some data loss from this.
>
> Regardless, your first step is to remove the OSDs that were on the failed
> journal. They are poison in your cluster.
>
> On Sun, Sep 2, 2018, 10:51 AM Lee <lqui...@gmail.com> wrote:
>
>> I followed:
>>
>> $ journal_uuid=$(sudo cat /var/lib/ceph/osd/ceph-0/journal_uuid)
>> $ sudo sgdisk --new=1:0:+20480M --change-name=1:'ceph journal'
>> --partition-guid=1:$journal_uuid
>> --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdk
>>
>> Then
>>
>> $ sudo ceph-osd --mkjournal -i 20
>> $ sudo service ceph start osd.20
>>
>> From 
>> https://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-journal-failure/
>>
>> Which they all started without a problem.
>>
>>
>> On Sun, 2 Sep 2018 at 15:43, David Turner <drakonst...@gmail.com> wrote:
>>
>>> It looks like osds on the first failed node are having problems. What
>>> commands did you run to bring it back online?
>>>
>>> On Sun, Sep 2, 2018, 10:27 AM Lee <lqui...@gmail.com> wrote:
>>>
>>>> Ok I have a lot in the health detail...
>>>>
>>>> root@node31-a4:~# ceph health detail
>>>> HEALTH_ERR 64 pgs backfill; 27 pgs backfill_toofull; 39 pgs
>>>> backfilling; 26 pgs degraded; 4 pgs down; 31 pgs incomplete; 1 pgs
>>>> inconsistent; 12 pgs recovery_wait; 1 pgs stale; 26 pgs stuck degraded; 31
>>>> pgs stuck inactive; 1 pgs stuck stale; 161 pgs stuck unclean; 9 pgs stuck
>>>> undersized; 9 pgs undersized; 726 requests are blocked > 32 sec; 9 osds
>>>> have slow requests; recovery 59636/5032695 objects degraded (1.185%);
>>>> recovery 1280976/5032695 objects misplaced (25.453%); 1 scrub errors;
>>>> noscrub,nodeep-scrub flag(s) set
>>>> pg 2.2a is stuck inactive for 97629.478505, current state incomplete,
>>>> last acting [24,5]
>>>> pg 2.b0 is stuck inactive for 98000.688979, current state incomplete,
>>>> last acting [24,7]
>>>> pg 9.42 is stuck inactive for 108836.103738, current state incomplete,
>>>> last acting [31,12]
>>>> pg 9.de is stuck inactive since forever, current state incomplete,
>>>> last acting [6,5]
>>>> pg 2.75 is stuck inactive since forever, current state down+incomplete,
>>>> last acting [7,15]
>>>> pg 9.dc is stuck inactive for 113491.800208, current state incomplete,
>>>> last acting [6,7]
>>>> pg 2.74 is stuck inactive for 97658.382960, current state incomplete,
>>>> last acting [13,5]
>>>> pg 9.1e is stuck inactive since forever, current state incomplete, last
>>>> acting [7,15]
>>>> pg 2.15 is stuck inactive since forever, current state incomplete, last
>>>> acting [7,31]
>>>> pg 11.1c is stuck inactive since forever, current state
>>>> down+incomplete, last acting [6,7]
>>>> pg 2.a1 is stuck inactive for 98785.888826, current state incomplete,
>>>> last acting [14,12]
>>>> pg 9.d8 is stuck inactive for 115082.575098, current state
>>>> down+incomplete, last acting [21,5]
>>>> pg 9.a8 is stuck inactive for 118575.035210, current state incomplete,
>>>> last acting [14,7]
>>>> pg 9.78 is stuck inactive since forever, current state incomplete, last
>>>> acting [5,24]
>>>> pg 2.a2 is stuck inactive since forever, current state incomplete, last
>>>> acting [5,13]
>>>> pg 7.16 is stuck inactive since forever, current state incomplete, last
>>>> acting [6,7]
>>>> pg 2.13 is stuck inactive since forever, current state incomplete, last
>>>> acting [7,10]
>>>> pg 9.f5 is stuck inactive for 103009.439003, current state incomplete,
>>>> last acting [18,5]
>>>> pg 2.d is stuck inactive since forever, current state incomplete, last
>>>> acting [5,10]
>>>> pg 9.5 is stuck inactive since forever, current state incomplete, last
>>>> acting [5,18]
>>>> pg 9.3 is stuck inactive since forever, current state incomplete, last
>>>> acting [7,15]
>>>> pg 9.fc is stuck inactive for 201476.092908, current state incomplete,
>>>> last acting [13,5]
>>>> pg 11.33 is stuck inactive since forever, current state
>>>> down+incomplete, last acting [7,6]
>>>> pg 9.3f is stuck inactive since forever, current state incomplete, last
>>>> acting [5,14]
>>>> pg 9.a is stuck inactive for 113328.467457, current state incomplete,
>>>> last acting [18,7]
>>>> pg 2.63 is stuck inactive for 97665.176520, current state incomplete,
>>>> last acting [31,7]
>>>> pg 2.3 is stuck inactive for 97655.279670, current state incomplete,
>>>> last acting [14,5]
>>>> pg 2.32 is stuck inactive since forever, current state incomplete, last
>>>> acting [5,13]
>>>> pg 2.bf is stuck inactive for 99913.875808, current state incomplete,
>>>> last acting [15,7]
>>>> pg 9.26 is stuck inactive since forever, current state incomplete, last
>>>> acting [5,24]
>>>> pg 9.22 is stuck inactive since forever, current state incomplete, last
>>>> acting [7,24]
>>>> pg 9.25 is stuck unclean for 20091.777921, current state
>>>> active+degraded+remapped+wait_backfill, last acting [15,2]
>>>> pg 7.2b is stuck unclean for 98830.660179, current state
>>>> stale+active+undersized+degraded, last acting [5]
>>>> pg 11.27 is stuck unclean for 1777813.502308, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [4,36]
>>>> pg 2.f1 is stuck unclean for 26585.481715, current state
>>>> active+recovery_wait+degraded, last acting [13,8]
>>>> pg 9.22 is stuck unclean since forever, current state incomplete, last
>>>> acting [7,24]
>>>> pg 2.29 is stuck unclean for 5629.190514, current state
>>>> active+remapped+wait_backfill, last acting [24,40]
>>>> pg 9.fb is stuck unclean for 3640.777545, current state
>>>> active+remapped+wait_backfill, last acting [8,39]
>>>> pg 9.23 is stuck unclean for 3595.306511, current state
>>>> active+remapped+wait_backfill, last acting [35,9]
>>>> pg 2.f3 is stuck unclean for 4993.558900, current state
>>>> active+remapped+wait_backfill, last acting [6,9]
>>>> pg 2.f2 is stuck unclean for 8871.835444, current state
>>>> active+recovery_wait+degraded, last acting [6,4]
>>>> pg 2.2a is stuck unclean for 97629.478922, current state incomplete,
>>>> last acting [24,5]
>>>> pg 2.ed is stuck unclean for 3595.395657, current state
>>>> active+remapped+backfilling, last acting [9,40]
>>>> pg 2.24 is stuck unclean for 6391.873856, current state
>>>> active+remapped+wait_backfill, last acting [13,40]
>>>> pg 2.27 is stuck unclean for 6814.809178, current state
>>>> active+recovery_wait+degraded, last acting [13,3]
>>>> pg 2.e8 is stuck unclean for 11759.373756, current state
>>>> active+remapped+wait_backfill, last acting [15,36]
>>>> pg 11.29 is stuck unclean for 6907.684021, current state
>>>> active+remapped+wait_backfill, last acting [14,40]
>>>> pg 2.eb is stuck unclean for 14474.951608, current state
>>>> active+remapped+backfilling, last acting [0,31]
>>>> pg 2.ea is stuck unclean for 3595.396597, current state
>>>> active+remapped+backfilling, last acting [9,34]
>>>> pg 12.13 is stuck unclean for 5629.177184, current state
>>>> active+remapped, last acting [8,31]
>>>> pg 2.1d is stuck unclean for 12245.891518, current state
>>>> active+remapped+backfilling, last acting [3,6]
>>>> pg 11.15 is stuck unclean for 14683.173113, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [34,9]
>>>> pg 2.1c is stuck unclean for 14683.755228, current state
>>>> active+degraded+remapped+backfilling, last acting [14,11]
>>>> pg 11.16 is stuck unclean for 5629.180301, current state
>>>> active+remapped+wait_backfill, last acting [15,40]
>>>> pg 2.1f is stuck unclean for 11858.149360, current state
>>>> active+remapped+wait_backfill, last acting [15,3]
>>>> pg 0.1c is stuck unclean for 6907.683196, current state
>>>> active+remapped+wait_backfill, last acting [12,3]
>>>> pg 2.1e is stuck unclean for 102531.318993, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [13]
>>>> pg 2.e0 is stuck unclean for 3571.898995, current state
>>>> active+remapped+inconsistent+wait_backfill, last acting [6,9]
>>>> pg 2.18 is stuck unclean for 3502.358091, current state
>>>> active+remapped+backfilling, last acting [18,9]
>>>> pg 2.e3 is stuck unclean for 12047.716242, current state
>>>> active+remapped+backfilling, last acting [4,41]
>>>> pg 11.13 is stuck unclean for 6907.682681, current state
>>>> active+remapped+wait_backfill, last acting [14,8]
>>>> pg 9.d6 is stuck unclean for 7416.596559, current state
>>>> active+remapped+wait_backfill, last acting [1,9]
>>>> pg 9.1e is stuck unclean since forever, current state incomplete, last
>>>> acting [7,15]
>>>> pg 11.1c is stuck unclean since forever, current state down+incomplete,
>>>> last acting [6,7]
>>>> pg 2.15 is stuck unclean since forever, current state incomplete, last
>>>> acting [7,31]
>>>> pg 2.dc is stuck unclean for 11709.774640, current state
>>>> active+remapped+backfilling, last acting [40,4]
>>>> pg 2.14 is stuck unclean for 3504.589025, current state
>>>> active+remapped+backfilling, last acting [18,9]
>>>> pg 2.df is stuck unclean for 5047.489499, current state
>>>> active+remapped+wait_backfill, last acting [0,13]
>>>> pg 11.1e is stuck unclean for 1968924.322629, current state
>>>> active+remapped+wait_backfill, last acting [3,38]
>>>> pg 2.de is stuck unclean for 97621.617826, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [3]
>>>> pg 9.1d is stuck unclean for 48349.818420, current state
>>>> active+remapped+backfill_toofull, last acting [12,36]
>>>> pg 3.17 is stuck unclean for 5629.187939, current state
>>>> active+remapped, last acting [5,13]
>>>> pg 2.d8 is stuck unclean for 7418.583365, current state
>>>> active+remapped+backfilling, last acting [21,41]
>>>> pg 7.15 is stuck unclean for 98830.449502, current state
>>>> active+remapped+wait_backfill, last acting [13,2]
>>>> pg 11.19 is stuck unclean for 3925.828027, current state
>>>> active+remapped+wait_backfill, last acting [15,38]
>>>> pg 2.db is stuck unclean for 3595.396853, current state
>>>> active+remapped+backfilling, last acting [9,40]
>>>> pg 9.18 is stuck unclean for 27500.110917, current state
>>>> active+remapped+backfill_toofull, last acting [18,13]
>>>> pg 7.16 is stuck unclean since forever, current state incomplete, last
>>>> acting [6,7]
>>>> pg 2.13 is stuck unclean since forever, current state incomplete, last
>>>> acting [7,10]
>>>> pg 9.de is stuck unclean since forever, current state incomplete, last
>>>> acting [6,5]
>>>> pg 9.6 is stuck unclean for 219342.087677, current state
>>>> active+remapped+backfill_toofull, last acting [2,41]
>>>> pg 2.d is stuck unclean since forever, current state incomplete, last
>>>> acting [5,10]
>>>> pg 9.df is stuck unclean for 48360.843924, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [35,2]
>>>> pg 8.6 is stuck unclean for 5629.183555, current state active+remapped,
>>>> last acting [12,13]
>>>> pg 2.d7 is stuck unclean for 83782.680541, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [36]
>>>> pg 9.dc is stuck unclean for 113491.800754, current state incomplete,
>>>> last acting [6,7]
>>>> pg 7.a is stuck unclean for 3844.286529, current state
>>>> active+remapped+wait_backfill, last acting [38,2]
>>>> pg 9.5 is stuck unclean since forever, current state incomplete, last
>>>> acting [5,18]
>>>> pg 4.8 is stuck unclean for 3893.186289, current state
>>>> active+recovery_wait+degraded, last acting [15,2]
>>>> pg 3.d0 is stuck unclean for 7418.584435, current state
>>>> active+remapped+wait_backfill, last acting [12,2]
>>>> pg 2.d1 is stuck unclean for 83769.259615, current state
>>>> active+undersized+degraded+remapped+backfill_toofull, last acting [36]
>>>> pg 9.3 is stuck unclean since forever, current state incomplete, last
>>>> acting [7,15]
>>>> pg 9.d8 is stuck unclean for 115082.575647, current state
>>>> down+incomplete, last acting [21,5]
>>>> pg 2.b is stuck unclean for 7418.564413, current state
>>>> active+remapped+backfilling, last acting [40,24]
>>>> pg 9.d9 is stuck unclean for 14681.601684, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [39,4]
>>>> pg 9.1 is stuck unclean for 3930.973909, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [39,3]
>>>> pg 2.cc is stuck unclean for 5078.643356, current state
>>>> active+remapped, last acting [40,24]
>>>> pg 11.d is stuck unclean for 14592.297817, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [36,4]
>>>> pg 9.c5 is stuck unclean for 3844.281162, current state
>>>> active+remapped+wait_backfill, last acting [5,38]
>>>> pg 9.a is stuck unclean for 113328.467988, current state incomplete,
>>>> last acting [18,7]
>>>> pg 11.9 is stuck unclean for 7418.578072, current state
>>>> active+remapped+wait_backfill, last acting [21,39]
>>>> pg 2.0 is stuck unclean for 97873.488751, current state
>>>> active+undersized+degraded+remapped+wait_backfill+backfill_toofull, last
>>>> acting [1]
>>>> pg 2.cb is stuck unclean for 25031.035830, current state
>>>> active+degraded+remapped+wait_backfill+backfill_toofull, last acting [1,4]
>>>> pg 9.8 is stuck unclean for 24341.317696, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [5,24]
>>>> pg 2.3 is stuck unclean for 97655.280232, current state incomplete,
>>>> last acting [14,5]
>>>> pg 2.2 is stuck unclean for 97734.492834, current state
>>>> active+recovery_wait+degraded+remapped, last acting [13,9]
>>>> pg 2.c4 is stuck unclean for 3595.525931, current state
>>>> active+remapped+backfilling, last acting [34,9]
>>>> pg 2.c7 is stuck unclean for 8871.729496, current state
>>>> active+recovery_wait+degraded, last acting [13,2]
>>>> pg 9.cb is stuck unclean for 5629.175300, current state
>>>> active+remapped, last acting [11,31]
>>>> pg 9.c9 is stuck unclean for 14683.752701, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [5,34]
>>>> pg 2.c2 is stuck unclean for 3504.738005, current state
>>>> active+remapped+wait_backfill, last acting [9,15]
>>>> pg 2.bd is stuck unclean for 3571.325492, current state
>>>> active+remapped+backfilling, last acting [39,9]
>>>> pg 2.bf is stuck unclean for 99913.876400, current state incomplete,
>>>> last acting [15,7]
>>>> pg 9.b3 is stuck unclean for 3925.828356, current state
>>>> active+remapped+wait_backfill, last acting [15,35]
>>>> pg 2.b5 is stuck unclean for 28026.340079, current state
>>>> active+remapped, last acting [2,40]
>>>> pg 2.b6 is stuck unclean for 11859.834286, current state
>>>> active+remapped+backfilling, last acting [1,31]
>>>> pg 2.b0 is stuck unclean for 98000.689674, current state incomplete,
>>>> last acting [24,7]
>>>> pg 2.b3 is stuck unclean for 5629.182841, current state
>>>> active+remapped+backfilling, last acting [3,0]
>>>> pg 2.ad is stuck unclean for 6907.677050, current state
>>>> active+remapped+backfilling, last acting [2,39]
>>>> pg 2.ae is stuck unclean for 11862.967346, current state
>>>> active+remapped+backfilling, last acting [34,13]
>>>> pg 9.a0 is stuck unclean for 14683.746136, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [1,3]
>>>> pg 2.aa is stuck unclean for 3571.307756, current state
>>>> active+remapped+backfilling, last acting [40,9]
>>>> pg 2.a7 is stuck unclean for 25030.658836, current state
>>>> active+remapped+wait_backfill, last acting [2,1]
>>>> pg 2.a6 is stuck unclean for 3930.913873, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [2,35]
>>>> pg 9.ad is stuck unclean for 8871.819919, current state
>>>> active+recovery_wait+degraded, last acting [6,8]
>>>> pg 2.a1 is stuck unclean for 98785.889529, current state incomplete,
>>>> last acting [14,12]
>>>> pg 1.a0 is stuck unclean for 5629.186426, current state
>>>> active+remapped, last acting [5,40]
>>>> pg 9.a8 is stuck unclean for 118575.035913, current state incomplete,
>>>> last acting [14,7]
>>>> pg 2.a2 is stuck unclean since forever, current state incomplete, last
>>>> acting [5,13]
>>>> pg 2.9d is stuck unclean for 11861.496234, current state
>>>> active+remapped+backfilling, last acting [6,38]
>>>> pg 2.9c is stuck unclean for 3506.888979, current state
>>>> active+remapped+wait_backfill, last acting [35,11]
>>>> pg 2.9b is stuck unclean for 5629.183979, current state
>>>> active+remapped+wait_backfill, last acting [6,0]
>>>> pg 9.91 is stuck unclean for 85752.028652, current state
>>>> active+remapped+wait_backfill, last acting [31,9]
>>>> pg 2.97 is stuck unclean for 9736.783735, current state
>>>> active+remapped+backfilling, last acting [35,24]
>>>> pg 2.91 is stuck unclean for 28553.979772, current state
>>>> active+remapped+backfilling, last acting [0,24]
>>>> pg 2.90 is stuck unclean for 30364.623932, current state
>>>> active+degraded+remapped+backfill_toofull, last acting [41,24]
>>>> pg 2.92 is stuck unclean for 25031.211566, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [8]
>>>> pg 9.99 is stuck unclean for 11862.827419, current state
>>>> active+remapped+wait_backfill, last acting [13,4]
>>>> pg 2.8f is stuck unclean for 17426.148382, current state
>>>> active+remapped+wait_backfill, last acting [15,9]
>>>> pg 2.88 is stuck unclean for 3591.054564, current state
>>>> active+remapped+wait_backfill, last acting [14,9]
>>>> pg 9.8f is stuck unclean for 3595.395794, current state
>>>> active+remapped+wait_backfill, last acting [9,15]
>>>> pg 2.87 is stuck unclean for 3844.271547, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [1,2]
>>>> pg 2.81 is stuck unclean for 83759.347793, current state
>>>> active+undersized+degraded+remapped+wait_backfill, last acting [39]
>>>> pg 9.8a is stuck unclean for 27697.026446, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [12,1]
>>>> pg 2.79 is stuck unclean for 12137.676488, current state
>>>> active+remapped+backfilling, last acting [7,40]
>>>> pg 2.78 is stuck unclean for 29127.120125, current state
>>>> active+remapped+backfilling, last acting [0,6]
>>>> pg 2.75 is stuck unclean since forever, current state down+incomplete,
>>>> last acting [7,15]
>>>> pg 2.74 is stuck unclean for 97658.383751, current state incomplete,
>>>> last acting [13,5]
>>>> pg 9.7c is stuck unclean for 114170.469704, current state
>>>> active+undersized+degraded+remapped+wait_backfill, last acting [39]
>>>> pg 9.7d is stuck unclean for 14077.123326, current state
>>>> active+remapped+backfilling, last acting [5,24]
>>>> pg 2.71 is stuck unclean for 11859.344208, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [21,3]
>>>> pg 2.73 is stuck unclean for 11859.417605, current state
>>>> active+remapped+backfilling, last acting [39,15]
>>>> pg 9.78 is stuck unclean since forever, current state incomplete, last
>>>> acting [5,24]
>>>> pg 9.79 is stuck unclean for 14595.569162, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [39,3]
>>>> pg 2.6d is stuck unclean for 27802.265038, current state
>>>> active+remapped+backfilling, last acting [4,13]
>>>> pg 9.62 is stuck unclean for 25030.488507, current state
>>>> active+remapped+backfill_toofull, last acting [36,2]
>>>> pg 2.6a is stuck unclean for 20323.517565, current state
>>>> active+remapped+wait_backfill, last acting [6,40]
>>>> pg 9.6c is stuck unclean for 14234.077824, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [41,2]
>>>> pg 9.6a is stuck unclean for 27035.043476, current state
>>>> active+remapped+backfill_toofull, last acting [36,4]
>>>> pg 2.63 is stuck unclean for 97665.177288, current state incomplete,
>>>> last acting [31,7]
>>>> pg 2.5d is stuck unclean for 3549.763078, current state
>>>> active+remapped+wait_backfill, last acting [9,34]
>>>> pg 2.5e is stuck unclean for 97736.064280, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [35,36]
>>>> pg 2.52 is stuck unclean for 8871.832670, current state
>>>> active+recovery_wait+degraded, last acting [6,4]
>>>> pg 9.59 is stuck unclean for 26868.986032, current state
>>>> active+remapped+wait_backfill, last acting [31,34]
>>>> pg 2.4f is stuck unclean for 12108.325792, current state
>>>> active+remapped+backfilling, last acting [11,40]
>>>> pg 2.49 is stuck unclean for 30446.302835, current state
>>>> active+remapped+wait_backfill, last acting [9,24]
>>>> pg 9.42 is stuck unclean for 108836.104626, current state incomplete,
>>>> last acting [31,12]
>>>> pg 2.45 is stuck unclean for 11284.580305, current state
>>>> active+degraded+remapped+backfilling, last acting [24,2]
>>>> pg 9.4f is stuck unclean for 3893.672356, current state
>>>> active+remapped+wait_backfill, last acting [0,21]
>>>> pg 2.44 is stuck unclean for 27623.439527, current state
>>>> active+recovery_wait+degraded+remapped, last acting [6,11]
>>>> pg 9.4c is stuck unclean for 6907.681859, current state
>>>> active+remapped+wait_backfill, last acting [15,36]
>>>> pg 2.46 is stuck unclean for 6907.682263, current state
>>>> active+remapped+backfilling, last acting [11,24]
>>>> pg 9.49 is stuck unclean for 14683.624639, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [2,31]
>>>> pg 11.35 is stuck unclean for 5872394.444913, current state
>>>> active+remapped+wait_backfill, last acting [40,36]
>>>> pg 2.3e is stuck unclean for 6907.683506, current state
>>>> active+remapped+backfilling, last acting [4,41]
>>>> pg 2.38 is stuck unclean for 5140.320861, current state
>>>> active+remapped+wait_backfill, last acting [0,5]
>>>> pg 2.3b is stuck unclean for 14456.624593, current state
>>>> active+remapped+wait_backfill+backfill_toofull, last acting [18,2]
>>>> pg 11.33 is stuck unclean since forever, current state down+incomplete,
>>>> last acting [7,6]
>>>> pg 10.3d is stuck unclean for 3595.395921, current state
>>>> active+remapped+wait_backfill, last acting [9,36]
>>>> pg 2.35 is stuck unclean for 8872.226171, current state
>>>> active+recovery_wait+degraded, last acting [6,11]
>>>> pg 2.fc is stuck unclean for 5820.330202, current state
>>>> active+remapped+backfilling, last acting [31,0]
>>>> pg 9.3f is stuck unclean since forever, current state incomplete, last
>>>> acting [5,14]
>>>> pg 2.ff is stuck unclean for 3595.396088, current state
>>>> active+remapped+backfilling, last acting [9,39]
>>>> pg 2.fe is stuck unclean for 6904.439076, current state
>>>> active+remapped+backfilling, last acting [21,0]
>>>> pg 9.f5 is stuck unclean for 103009.439909, current state incomplete,
>>>> last acting [18,5]
>>>> pg 7.34 is stuck unclean for 3886.510000, current state
>>>> active+remapped+wait_backfill, last acting [13,39]
>>>> pg 2.fb is stuck unclean for 57173.985429, current state
>>>> active+recovery_wait+degraded+remapped, last acting [6,8]
>>>> pg 2.32 is stuck unclean since forever, current state incomplete, last
>>>> acting [5,13]
>>>> pg 9.fe is stuck unclean for 7418.564930, current state
>>>> active+recovery_wait+degraded+remapped, last acting [6,3]
>>>> pg 9.26 is stuck unclean since forever, current state incomplete, last
>>>> acting [5,24]
>>>> pg 2.f7 is stuck unclean for 6915.532617, current state
>>>> active+remapped+backfilling, last acting [4,15]
>>>> pg 9.fc is stuck unclean for 201476.093824, current state incomplete,
>>>> last acting [13,5]
>>>> pg 7.2b is stuck undersized for 64282.169836, current state
>>>> stale+active+undersized+degraded, last acting [5]
>>>> pg 2.1e is stuck undersized for 3895.207475, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [13]
>>>> pg 2.de is stuck undersized for 3886.529396, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [3]
>>>> pg 2.d7 is stuck undersized for 7417.316099, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [36]
>>>> pg 2.d1 is stuck undersized for 6903.297196, current state
>>>> active+undersized+degraded+remapped+backfill_toofull, last acting [36]
>>>> pg 2.0 is stuck undersized for 4999.401505, current state
>>>> active+undersized+degraded+remapped+wait_backfill+backfill_toofull, last
>>>> acting [1]
>>>> pg 2.92 is stuck undersized for 4999.406547, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [8]
>>>> pg 2.81 is stuck undersized for 7417.378668, current state
>>>> active+undersized+degraded+remapped+wait_backfill, last acting [39]
>>>> pg 9.7c is stuck undersized for 3894.953894, current state
>>>> active+undersized+degraded+remapped+wait_backfill, last acting [39]
>>>> pg 9.25 is stuck degraded for 7413.083043, current state
>>>> active+degraded+remapped+wait_backfill, last acting [15,2]
>>>> pg 7.2b is stuck degraded for 64282.169913, current state
>>>> stale+active+undersized+degraded, last acting [5]
>>>> pg 2.f1 is stuck degraded for 3848.032008, current state
>>>> active+recovery_wait+degraded, last acting [13,8]
>>>> pg 2.f2 is stuck degraded for 7411.108195, current state
>>>> active+recovery_wait+degraded, last acting [6,4]
>>>> pg 2.27 is stuck degraded for 3893.230317, current state
>>>> active+recovery_wait+degraded, last acting [13,3]
>>>> pg 2.1c is stuck degraded for 7414.316299, current state
>>>> active+degraded+remapped+backfilling, last acting [14,11]
>>>> pg 2.1e is stuck degraded for 3895.207564, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [13]
>>>> pg 2.de is stuck degraded for 3886.529484, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [3]
>>>> pg 2.d7 is stuck degraded for 7417.316187, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [36]
>>>> pg 4.8 is stuck degraded for 3490.406821, current state
>>>> active+recovery_wait+degraded, last acting [15,2]
>>>> pg 2.d1 is stuck degraded for 6903.297288, current state
>>>> active+undersized+degraded+remapped+backfill_toofull, last acting [36]
>>>> pg 2.0 is stuck degraded for 4999.401597, current state
>>>> active+undersized+degraded+remapped+wait_backfill+backfill_toofull, last
>>>> acting [1]
>>>> pg 2.cb is stuck degraded for 7413.316930, current state
>>>> active+degraded+remapped+wait_backfill+backfill_toofull, last acting [1,4]
>>>> pg 2.2 is stuck degraded for 3894.930841, current state
>>>> active+recovery_wait+degraded+remapped, last acting [13,9]
>>>> pg 2.c7 is stuck degraded for 3886.500328, current state
>>>> active+recovery_wait+degraded, last acting [13,2]
>>>> pg 9.ad is stuck degraded for 7411.181412, current state
>>>> active+recovery_wait+degraded, last acting [6,8]
>>>> pg 2.90 is stuck degraded for 3893.715235, current state
>>>> active+degraded+remapped+backfill_toofull, last acting [41,24]
>>>> pg 2.92 is stuck degraded for 4999.406655, current state
>>>> active+undersized+degraded+remapped+backfilling, last acting [8]
>>>> pg 2.81 is stuck degraded for 7417.378776, current state
>>>> active+undersized+degraded+remapped+wait_backfill, last acting [39]
>>>> pg 9.7c is stuck degraded for 3894.954001, current state
>>>> active+undersized+degraded+remapped+wait_backfill, last acting [39]
>>>> pg 2.52 is stuck degraded for 7411.108431, current state
>>>> active+recovery_wait+degraded, last acting [6,4]
>>>> pg 2.45 is stuck degraded for 3892.755878, current state
>>>> active+degraded+remapped+backfilling, last acting [24,2]
>>>> pg 2.44 is stuck degraded for 7411.213966, current state
>>>> active+recovery_wait+degraded+remapped, last acting [6,11]
>>>> pg 2.35 is stuck degraded for 7411.295348, current state
>>>> active+recovery_wait+degraded, last acting [6,11]
>>>> pg 2.fb is stuck degraded for 6903.301076, current state
>>>> active+recovery_wait+degraded+remapped, last acting [6,8]
>>>> pg 9.fe is stuck degraded for 7413.453955, current state
>>>> active+recovery_wait+degraded+remapped, last acting [6,3]
>>>> pg 7.2b is stuck stale for 64232.262041, current state
>>>> stale+active+undersized+degraded, last acting [5]
>>>> pg 2.fc is active+remapped+backfilling, acting [31,0]
>>>> pg 2.ff is active+remapped+backfilling, acting [9,39]
>>>> pg 9.f5 is incomplete, acting [18,5]
>>>> pg 2.fe is active+remapped+backfilling, acting [21,0]
>>>> pg 2.fb is active+recovery_wait+degraded+remapped, acting [6,8]
>>>> pg 9.fe is active+recovery_wait+degraded+remapped, acting [6,3]
>>>> pg 9.fc is incomplete, acting [13,5]
>>>> pg 2.f7 is active+remapped+backfilling, acting [4,15]
>>>> pg 2.f1 is active+recovery_wait+degraded, acting [13,8]
>>>> pg 9.fb is active+remapped+wait_backfill, acting [8,39]
>>>> pg 2.f3 is active+remapped+wait_backfill, acting [6,9]
>>>> pg 2.f2 is active+recovery_wait+degraded, acting [6,4]
>>>> pg 2.ed is active+remapped+backfilling, acting [9,40]
>>>> pg 2.e8 is active+remapped+wait_backfill, acting [15,36]
>>>> pg 2.eb is active+remapped+backfilling, acting [0,31]
>>>> pg 2.ea is active+remapped+backfilling, acting [9,34]
>>>> pg 2.e0 is active+remapped+inconsistent+wait_backfill, acting [6,9]
>>>> pg 2.e3 is active+remapped+backfilling, acting [4,41]
>>>> pg 9.d6 is active+remapped+wait_backfill, acting [1,9]
>>>> pg 2.dc is active+remapped+backfilling, acting [40,4]
>>>> pg 2.df is active+remapped+wait_backfill, acting [0,13]
>>>> pg 2.de is active+undersized+degraded+remapped+backfilling, acting [3]
>>>> pg 2.d8 is active+remapped+backfilling, acting [21,41]
>>>> pg 2.db is active+remapped+backfilling, acting [9,40]
>>>> pg 9.de is incomplete, acting [6,5]
>>>> pg 9.df is active+remapped+wait_backfill+backfill_toofull, acting [35,2]
>>>> pg 9.dc is incomplete, acting [6,7]
>>>> pg 2.d7 is active+undersized+degraded+remapped+backfilling, acting [36]
>>>> pg 2.d1 is active+undersized+degraded+remapped+backfill_toofull, acting
>>>> [36]
>>>> pg 3.d0 is active+remapped+wait_backfill, acting [12,2]
>>>> pg 9.d8 is down+incomplete, acting [21,5]
>>>> pg 9.d9 is active+remapped+wait_backfill+backfill_toofull, acting [39,4]
>>>> pg 9.c5 is active+remapped+wait_backfill, acting [5,38]
>>>> pg 2.cb is active+degraded+remapped+wait_backfill+backfill_toofull,
>>>> acting [1,4]
>>>> pg 2.c4 is active+remapped+backfilling, acting [34,9]
>>>> pg 2.c7 is active+recovery_wait+degraded, acting [13,2]
>>>> pg 2.c2 is active+remapped+wait_backfill, acting [9,15]
>>>> pg 9.c9 is active+remapped+wait_backfill+backfill_toofull, acting [5,34]
>>>> pg 2.bd is active+remapped+backfilling, acting [39,9]
>>>> pg 2.bf is incomplete, acting [15,7]
>>>> pg 9.b3 is active+remapped+wait_backfill, acting [15,35]
>>>> pg 2.b6 is active+remapped+backfilling, acting [1,31]
>>>> pg 2.b0 is incomplete, acting [24,7]
>>>> pg 2.b3 is active+remapped+backfilling, acting [3,0]
>>>> pg 2.ad is active+remapped+backfilling, acting [2,39]
>>>> pg 2.ae is active+remapped+backfilling, acting [34,13]
>>>> pg 9.a0 is active+remapped+wait_backfill+backfill_toofull, acting [1,3]
>>>> pg 2.aa is active+remapped+backfilling, acting [40,9]
>>>> pg 2.a7 is active+remapped+wait_backfill, acting [2,1]
>>>> pg 2.a6 is active+remapped+wait_backfill+backfill_toofull, acting [2,35]
>>>> pg 9.ad is active+recovery_wait+degraded, acting [6,8]
>>>> pg 2.a1 is incomplete, acting [14,12]
>>>> pg 9.a8 is incomplete, acting [14,7]
>>>> pg 2.a2 is incomplete, acting [5,13]
>>>> pg 2.9d is active+remapped+backfilling, acting [6,38]
>>>> pg 2.9c is active+remapped+wait_backfill, acting [35,11]
>>>> pg 2.9b is active+remapped+wait_backfill, acting [6,0]
>>>> pg 9.91 is active+remapped+wait_backfill, acting [31,9]
>>>> pg 2.97 is active+remapped+backfilling, acting [35,24]
>>>> pg 2.91 is active+remapped+backfilling, acting [0,24]
>>>> pg 2.90 is active+degraded+remapped+backfill_toofull, acting [41,24]
>>>> pg 2.92 is active+undersized+degraded+remapped+backfilling, acting [8]
>>>> pg 9.99 is active+remapped+wait_backfill, acting [13,4]
>>>> pg 2.8f is active+remapped+wait_backfill, acting [15,9]
>>>> pg 2.88 is active+remapped+wait_backfill, acting [14,9]
>>>> pg 9.8f is active+remapped+wait_backfill, acting [9,15]
>>>> pg 2.87 is active+remapped+wait_backfill+backfill_toofull, acting [1,2]
>>>> pg 2.81 is active+undersized+degraded+remapped+wait_backfill, acting
>>>> [39]
>>>> pg 9.8a is active+remapped+wait_backfill+backfill_toofull, acting [12,1]
>>>> pg 2.79 is active+remapped+backfilling, acting [7,40]
>>>> pg 2.78 is active+remapped+backfilling, acting [0,6]
>>>> pg 2.75 is down+incomplete, acting [7,15]
>>>> pg 2.74 is incomplete, acting [13,5]
>>>> pg 9.7c is active+undersized+degraded+remapped+wait_backfill, acting
>>>> [39]
>>>> pg 9.7d is active+remapped+backfilling, acting [5,24]
>>>> pg 2.71 is active+remapped+wait_backfill+backfill_toofull, acting [21,3]
>>>> pg 2.73 is active+remapped+backfilling, acting [39,15]
>>>> pg 9.78 is incomplete, acting [5,24]
>>>> pg 9.79 is active+remapped+wait_backfill+backfill_toofull, acting [39,3]
>>>> pg 2.6d is active+remapped+backfilling, acting [4,13]
>>>> pg 9.62 is active+remapped+backfill_toofull, acting [36,2]
>>>> pg 2.6a is active+remapped+wait_backfill, acting [6,40]
>>>> pg 9.6c is active+remapped+wait_backfill+backfill_toofull, acting [41,2]
>>>> pg 9.6a is active+remapped+backfill_toofull, acting [36,4]
>>>> pg 2.63 is incomplete, acting [31,7]
>>>> pg 2.5d is active+remapped+wait_backfill, acting [9,34]
>>>> pg 2.5e is active+remapped+wait_backfill+backfill_toofull, acting
>>>> [35,36]
>>>> pg 2.52 is active+recovery_wait+degraded, acting [6,4]
>>>> pg 9.59 is active+remapped+wait_backfill, acting [31,34]
>>>> pg 2.4f is active+remapped+backfilling, acting [11,40]
>>>> pg 2.49 is active+remapped+wait_backfill, acting [9,24]
>>>> pg 9.42 is incomplete, acting [31,12]
>>>> pg 2.45 is active+degraded+remapped+backfilling, acting [24,2]
>>>> pg 2.44 is active+recovery_wait+degraded+remapped, acting [6,11]
>>>> pg 9.4f is active+remapped+wait_backfill, acting [0,21]
>>>> pg 9.4c is active+remapped+wait_backfill, acting [15,36]
>>>> pg 2.46 is active+remapped+backfilling, acting [11,24]
>>>> pg 9.49 is active+remapped+wait_backfill+backfill_toofull, acting [2,31]
>>>> pg 11.35 is active+remapped+wait_backfill, acting [40,36]
>>>> pg 2.3e is active+remapped+backfilling, acting [4,41]
>>>> pg 2.38 is active+remapped+wait_backfill, acting [0,5]
>>>> pg 2.3b is active+remapped+wait_backfill+backfill_toofull, acting [18,2]
>>>> pg 11.33 is down+incomplete, acting [7,6]
>>>> pg 2.35 is active+recovery_wait+degraded, acting [6,11]
>>>> pg 10.3d is active+remapped+wait_backfill, acting [9,36]
>>>> pg 9.3f is incomplete, acting [5,14]
>>>> pg 7.34 is active+remapped+wait_backfill, acting [13,39]
>>>> pg 2.32 is incomplete, acting [5,13]
>>>> pg 9.26 is incomplete, acting [5,24]
>>>> pg 11.27 is active+remapped+wait_backfill+backfill_toofull, acting
>>>> [4,36]
>>>> pg 9.25 is active+degraded+remapped+wait_backfill, acting [15,2]
>>>> pg 2.29 is active+remapped+wait_backfill, acting [24,40]
>>>> pg 9.22 is incomplete, acting [7,24]
>>>> pg 9.23 is active+remapped+wait_backfill, acting [35,9]
>>>> pg 2.2a is incomplete, acting [24,5]
>>>> pg 2.24 is active+remapped+wait_backfill, acting [13,40]
>>>> pg 2.27 is active+recovery_wait+degraded, acting [13,3]
>>>> pg 11.29 is active+remapped+wait_backfill, acting [14,40]
>>>> pg 2.1d is active+remapped+backfilling, acting [3,6]
>>>> pg 2.1c is active+degraded+remapped+backfilling, acting [14,11]
>>>> pg 11.15 is active+remapped+wait_backfill+backfill_toofull, acting
>>>> [34,9]
>>>> pg 2.1f is active+remapped+wait_backfill, acting [15,3]
>>>> pg 11.16 is active+remapped+wait_backfill, acting [15,40]
>>>> pg 2.1e is active+undersized+degraded+remapped+backfilling, acting [13]
>>>> pg 0.1c is active+remapped+wait_backfill, acting [12,3]
>>>> pg 2.18 is active+remapped+backfilling, acting [18,9]
>>>> pg 11.13 is active+remapped+wait_backfill, acting [14,8]
>>>> pg 2.15 is incomplete, acting [7,31]
>>>> pg 11.1c is down+incomplete, acting [6,7]
>>>> pg 9.1e is incomplete, acting [7,15]
>>>> pg 2.14 is active+remapped+backfilling, acting [18,9]
>>>> pg 11.1e is active+remapped+wait_backfill, acting [3,38]
>>>> pg 9.1d is active+remapped+backfill_toofull, acting [12,36]
>>>> pg 11.19 is active+remapped+wait_backfill, acting [15,38]
>>>> pg 7.15 is active+remapped+wait_backfill, acting [13,2]
>>>> pg 2.13 is incomplete, acting [7,10]
>>>> pg 7.16 is incomplete, acting [6,7]
>>>> pg 9.18 is active+remapped+backfill_toofull, acting [18,13]
>>>> pg 2.d is incomplete, acting [5,10]
>>>> pg 9.6 is active+remapped+backfill_toofull, acting [2,41]
>>>> pg 7.a is active+remapped+wait_backfill, acting [38,2]
>>>> pg 4.8 is active+recovery_wait+degraded, acting [15,2]
>>>> pg 9.5 is incomplete, acting [5,18]
>>>> pg 9.3 is incomplete, acting [7,15]
>>>> pg 2.b is active+remapped+backfilling, acting [40,24]
>>>> pg 9.1 is active+remapped+wait_backfill+backfill_toofull, acting [39,3]
>>>> pg 11.d is active+remapped+wait_backfill+backfill_toofull, acting [36,4]
>>>> pg 9.a is incomplete, acting [18,7]
>>>> pg 2.0 is
>>>> active+undersized+degraded+remapped+wait_backfill+backfill_toofull, acting
>>>> [1]
>>>> pg 11.9 is active+remapped+wait_backfill, acting [21,39]
>>>> pg 2.3 is incomplete, acting [14,5]
>>>> pg 9.8 is active+remapped+wait_backfill+backfill_toofull, acting [5,24]
>>>> pg 2.2 is active+recovery_wait+degraded+remapped, acting [13,9]
>>>> 33 ops are blocked > 16777.2 sec
>>>> 368 ops are blocked > 8388.61 sec
>>>> 238 ops are blocked > 4194.3 sec
>>>> 87 ops are blocked > 1048.58 sec
>>>> 2 ops are blocked > 8388.61 sec on osd.5
>>>> 98 ops are blocked > 4194.3 sec on osd.5
>>>> 98 ops are blocked > 8388.61 sec on osd.6
>>>> 1 ops are blocked > 8388.61 sec on osd.7
>>>> 27 ops are blocked > 4194.3 sec on osd.7
>>>> 12 ops are blocked > 4194.3 sec on osd.13
>>>> 87 ops are blocked > 1048.58 sec on osd.13
>>>> 2 ops are blocked > 16777.2 sec on osd.14
>>>> 98 ops are blocked > 8388.61 sec on osd.14
>>>> 3 ops are blocked > 16777.2 sec on osd.15
>>>> 97 ops are blocked > 8388.61 sec on osd.15
>>>> 1 ops are blocked > 4194.3 sec on osd.18
>>>> 100 ops are blocked > 4194.3 sec on osd.24
>>>> 28 ops are blocked > 16777.2 sec on osd.31
>>>> 72 ops are blocked > 8388.61 sec on osd.31
>>>> 9 osds have slow requests
>>>> recovery 59636/5032695 objects degraded (1.185%)
>>>> recovery 1280976/5032695 objects misplaced (25.453%)
>>>> 1 scrub errors
>>>> noscrub,nodeep-scrub flag(s) set
>>>>
>>>>
>>>> On the first failed host is 6, 13, 14, 15, 18, 24, 31
>>>>
>>>> On the second host that went down was 5 and 7
>>>>
>>>>
>>>>
>>>> On Sun, 2 Sep 2018 at 15:15, David Turner <drakonst...@gmail.com>
>>>> wrote:
>>>>
>>>>> When the first node went offline with a dead SSD journal, all of the
>>>>> dates on the OSDs was useless. Unless you could flush the journals, you
>>>>> can't guarantee that a wire the cluster think happened actually made it to
>>>>> the disk.  The proper procedure here is to remove those OSDs and add them
>>>>> again as new OSDs.
>>>>>
>>>>> `ceph health detail` will give you some more information on the
>>>>> blocked requests. Depending on what that shows you can often find the OSD
>>>>> that is causing the problems.  But your biggest problem is that you have
>>>>> dishes with potentially inconsistent data in your closer.
>>>>>
>>>>> On Sun, Sep 2, 2018, 4:42 AM Lee <lqui...@gmail.com> wrote:
>>>>>
>>>>>> Running 0.94.5 as part of a Openstack enviroment, our ceph setup is
>>>>>> 3x OSD Nodes 3x MON Nodes, yesterday we had a aircon outage in our 
>>>>>> hosting
>>>>>> enviroment, 1 OSD node failed (offline with a the journal SSD dead) left
>>>>>> with 2 nodes running correctly, 2 hours later a second OSD node failed
>>>>>> complaining of readwrite errors to the physical drives, i assume this 
>>>>>> was a
>>>>>> heat issue as when rebooted this came back online ok and ceph started to
>>>>>> repair itself. We have since brought the first failed node back on by
>>>>>> replacing the ssd and recreating the journals hoping it would all 
>>>>>> repair..
>>>>>> Our pools are min 2 repl.
>>>>>>
>>>>>> The problem we have is client IO (read) is totally blocked, and when
>>>>>> I query the stuck PG's it just hangs..
>>>>>>
>>>>>> For example the check version command just errors with:
>>>>>>
>>>>>> Error EINTR: problem getting command descriptions from on various
>>>>>> OSD's so I cannot even query the inactive PG's
>>>>>>
>>>>>> root@node31-a4:~# ceph -s
>>>>>>     cluster 7c24e1b9-24b3-4a1b-8889-9b2d7fd88cd2
>>>>>>      health HEALTH_WARN
>>>>>>             83 pgs backfill
>>>>>>             2 pgs backfill_toofull
>>>>>>             3 pgs backfilling
>>>>>>             48 pgs degraded
>>>>>>             1 pgs down
>>>>>>             31 pgs incomplete
>>>>>>             1 pgs recovering
>>>>>>             29 pgs recovery_wait
>>>>>>             1 pgs stale
>>>>>>             48 pgs stuck degraded
>>>>>>             31 pgs stuck inactive
>>>>>>             1 pgs stuck stale
>>>>>>             148 pgs stuck unclean
>>>>>>             17 pgs stuck undersized
>>>>>>             17 pgs undersized
>>>>>>             599 requests are blocked > 32 sec
>>>>>>             recovery 111489/4697618 objects degraded (2.373%)
>>>>>>             recovery 772268/4697618 objects misplaced (16.440%)
>>>>>>             recovery 1/2171314 unfound (0.000%)
>>>>>>      monmap e5: 3 mons at {bc07s12-a7=
>>>>>> 172.27.16.11:6789/0,bc07s13-a7=172.27.16.21:6789/0,bc07s14-a7=172.27.16.15:6789/0
>>>>>> }
>>>>>>             election epoch 198, quorum 0,1,2
>>>>>> bc07s12-a7,bc07s14-a7,bc07s13-a7
>>>>>>      osdmap e18727: 25 osds: 25 up, 25 in; 90 remapped pgs
>>>>>>       pgmap v70996322: 1792 pgs, 13 pools, 8210 GB data, 2120 kobjects
>>>>>>             16783 GB used, 6487 GB / 23270 GB avail
>>>>>>             111489/4697618 objects degraded (2.373%)
>>>>>>             772268/4697618 objects misplaced (16.440%)
>>>>>>             1/2171314 unfound (0.000%)
>>>>>>                 1639 active+clean
>>>>>>                   66 active+remapped+wait_backfill
>>>>>>                   30 incomplete
>>>>>>                   25 active+recovery_wait+degraded
>>>>>>                   15 active+undersized+degraded+remapped+wait_backfill
>>>>>>                    4 active+recovery_wait+degraded+remapped
>>>>>>                    4 active+clean+scrubbing
>>>>>>                    2 active+remapped+wait_backfill+backfill_toofull
>>>>>>                    1 down+incomplete
>>>>>>                    1 active+remapped+backfilling
>>>>>>                    1 active+clean+scrubbing+deep
>>>>>>                    1 stale+active+undersized+degraded
>>>>>>                    1 active+undersized+degraded+remapped+backfilling
>>>>>>                    1 active+degraded+remapped+backfilling
>>>>>>                    1 active+recovering+degraded
>>>>>> recovery io 29385 kB/s, 7 objects/s
>>>>>>   client io 5877 B/s wr, 1 op/s
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to