Re: [ceph-users] PG down & incomplete

Olivier Bonvalet Fri, 17 May 2013 01:31:26 -0700

Hi,

thanks for your answer. In fact I have several different problems, which
I tried to solve separatly :


1) I loose 2 OSD, and some pools have only 2 replicas. So some data was
lost.
2) One monitor refuse the Cuttlefish upgrade, so I only have 4 of 5
monitors running.
3) I have 4 old inconsistent PG that I can't repair.


So the status :

   health HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck
inactive; 15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors;
noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
   monmap e7: 5 mons at
{a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
 election epoch 2584, quorum 0,1,2,3 a,b,c,e
   osdmap e82502: 50 osds: 48 up, 48 in
    pgmap v12807617: 7824 pgs: 7803 active+clean, 1 active+clean
+scrubbing, 15 incomplete, 4 active+clean+inconsistent, 1 active+clean
+scrubbing+deep; 5676 GB data, 18948 GB used, 18315 GB / 37263 GB avail;
137KB/s rd, 1852KB/s wr, 199op/s
   mdsmap e1: 0/0/1 up



The tree :

# id    weight  type name       up/down reweight
-8      14.26   root SSDroot
-27     8               datacenter SSDrbx2
-26     8                       room SSDs25
-25     8                               net SSD188-165-12
-24     8                                       rack SSD25B09
-23     8                                               host lyll
46      2                                                       osd.46  up      
1       
47      2                                                       osd.47  up      
1       
48      2                                                       osd.48  up      
1       
49      2                                                       osd.49  up      
1       
-10     4.26            datacenter SSDrbx3
-12     2                       room SSDs43
-13     2                               net SSD178-33-122
-16     2                                       rack SSD43S01
-17     2                                               host kaino
42      1                                                       osd.42  up      
1       
43      1                                                       osd.43  up      
1       
-22     2.26                    room SSDs45
-21     2.26                            net SSD5-135-138
-20     2.26                                    rack SSD45F01
-19     2.26                                            host taman
44      1.13                                                    osd.44  up      
1       
45      1.13                                                    osd.45  up      
1       
-9      2               datacenter SSDrbx4
-11     2                       room SSDs52
-14     2                               net SSD176-31-226
-15     2                                       rack SSD52B09
-18     2                                               host dragan
40      1                                                       osd.40  up      
1       
41      1                                                       osd.41  up      
1       
-1      33.43   root SASroot
-100    15.9            datacenter SASrbx1
-90     15.9                    room SASs15
-72     15.9                            net SAS188-165-15
-40     8                                       rack SAS15B01
-3      8                                               host brontes
0       1                                                       osd.0   up      
1       
1       1                                                       osd.1   up      
1       
2       1                                                       osd.2   up      
1       
3       1                                                       osd.3   up      
1       
4       1                                                       osd.4   up      
1       
5       1                                                       osd.5   up      
1       
6       1                                                       osd.6   up      
1       
7       1                                                       osd.7   up      
1       
-41     7.9                                     rack SAS15B02
-6      7.9                                             host alim
24      1                                                       osd.24  up      
1       
25      1                                                       osd.25  down    
0       
26      1                                                       osd.26  up      
1       
27      1                                                       osd.27  up      
1       
28      1                                                       osd.28  up      
1       
29      1                                                       osd.29  up      
1       
30      1                                                       osd.30  up      
1       
31      0.9                                                     osd.31  up      
1       
-101    17.53           datacenter SASrbx2
-91     17.53                   room SASs27
-70     1.6                             net SAS188-165-13
-44     0                                       rack SAS27B04
-7      0                                               host bul
-45     1.6                                     rack SAS27B06
-4      1.6                                             host okko
32      0.2                                                     osd.32  up      
1       
33      0.2                                                     osd.33  up      
1       
34      0.2                                                     osd.34  up      
1       
35      0.2                                                     osd.35  up      
1       
36      0.2                                                     osd.36  up      
1       
37      0.2                                                     osd.37  up      
1       
38      0.2                                                     osd.38  up      
1       
39      0.2                                                     osd.39  up      
1       
-71     15.93                           net SAS188-165-14
-42     8                                       rack SAS27A03
-5      8                                               host noburo
8       1                                                       osd.8   up      
1       
9       1                                                       osd.9   up      
1       
18      1                                                       osd.18  up      
1       
19      1                                                       osd.19  up      
1       
20      1                                                       osd.20  up      
1       
21      1                                                       osd.21  up      
1       
22      1                                                       osd.22  up      
1       
23      1                                                       osd.23  up      
1       
-43     7.93                                    rack SAS27A04
-2      7.93                                            host keron
10      0.97                                                    osd.10  up      
1       
11      1                                                       osd.11  up      
1       
12      1                                                       osd.12  up      
1       
13      1                                                       osd.13  up      
1       
14      0.98                                                    osd.14  up      
1       
15      1                                                       osd.15  down    
0       
16      0.98                                                    osd.16  up      
1       
17      1                                                       osd.17  up      
1       


Here I have 2 roots : SSDroot and SASroot. All my OSD/PG problems are on
the SAS branch, and my CRUSH rules use per "net" replication.

The osd.15 have a failling disk since long time, its data was correctly
moved (= OSD was out until the cluster obtain HEALTH_OK).
The osd.25 is a buggy OSD that I can't remove or change : if I balance
it's PG on other OSD, then this others OSD crash. That problem occur
before I loose the osd.19 : OSD was unable to mark that PG as
inconsistent since it was crashing during scrub. For me, all
inconsistencies come from this OSD.
The osd.19 was a failling disk, that I changed.


And the health detail :

HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck inactive;
15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors; noout flag(s)
set; 1 mons down, quorum 0,1,2,3 a,b,c,e
pg 4.5c is stuck inactive since forever, current state incomplete, last
acting [19,30]
pg 8.71d is stuck inactive since forever, current state incomplete, last
acting [24,19]
pg 8.3fa is stuck inactive since forever, current state incomplete, last
acting [19,31]
pg 8.3e0 is stuck inactive since forever, current state incomplete, last
acting [31,19]
pg 8.56c is stuck inactive since forever, current state incomplete, last
acting [19,28]
pg 8.19f is stuck inactive since forever, current state incomplete, last
acting [31,19]
pg 8.792 is stuck inactive since forever, current state incomplete, last
acting [19,28]
pg 4.0 is stuck inactive since forever, current state incomplete, last
acting [28,19]
pg 8.78a is stuck inactive since forever, current state incomplete, last
acting [31,19]
pg 8.23e is stuck inactive since forever, current state incomplete, last
acting [32,13]
pg 8.2ff is stuck inactive since forever, current state incomplete, last
acting [6,19]
pg 8.5e2 is stuck inactive since forever, current state incomplete, last
acting [0,19]
pg 8.528 is stuck inactive since forever, current state incomplete, last
acting [31,19]
pg 8.20f is stuck inactive since forever, current state incomplete, last
acting [31,19]
pg 8.372 is stuck inactive since forever, current state incomplete, last
acting [19,24]
pg 4.5c is stuck unclean since forever, current state incomplete, last
acting [19,30]
pg 8.71d is stuck unclean since forever, current state incomplete, last
acting [24,19]
pg 8.3fa is stuck unclean since forever, current state incomplete, last
acting [19,31]
pg 8.3e0 is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.56c is stuck unclean since forever, current state incomplete, last
acting [19,28]
pg 8.19f is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.792 is stuck unclean since forever, current state incomplete, last
acting [19,28]
pg 4.0 is stuck unclean since forever, current state incomplete, last
acting [28,19]
pg 8.78a is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.23e is stuck unclean since forever, current state incomplete, last
acting [32,13]
pg 8.2ff is stuck unclean since forever, current state incomplete, last
acting [6,19]
pg 8.5e2 is stuck unclean since forever, current state incomplete, last
acting [0,19]
pg 8.528 is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.20f is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.372 is stuck unclean since forever, current state incomplete, last
acting [19,24]
pg 8.792 is incomplete, acting [19,28]
pg 8.78a is incomplete, acting [31,19]
pg 8.71d is incomplete, acting [24,19]
pg 8.5e2 is incomplete, acting [0,19]
pg 8.56c is incomplete, acting [19,28]
pg 8.528 is incomplete, acting [31,19]
pg 8.3fa is incomplete, acting [19,31]
pg 8.3e0 is incomplete, acting [31,19]
pg 8.372 is incomplete, acting [19,24]
pg 8.2ff is incomplete, acting [6,19]
pg 8.23e is incomplete, acting [32,13]
pg 8.20f is incomplete, acting [31,19]
pg 8.19f is incomplete, acting [31,19]
pg 3.7c is active+clean+inconsistent, acting [24,13,39]
pg 3.6b is active+clean+inconsistent, acting [28,23,5]
pg 4.5c is incomplete, acting [19,30]
pg 3.d is active+clean+inconsistent, acting [29,4,11]
pg 4.0 is incomplete, acting [28,19]
pg 3.1 is active+clean+inconsistent, acting [28,19,5]
osd.10 is near full at 85%
19 scrub errors
noout flag(s) set
mon.d (rank 4) addr 10.0.0.6:6789/0 is down (out of quorum)


Pools 4 and 8 have only 2 replica, and pool 3 have 3 replica but
inconsistent data.

Thanks in advance.

Le vendredi 17 mai 2013 à 00:14 -0700, John Wilkins a écrit :
> If you can follow the documentation here:
> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/  and
> http://ceph.com/docs/master/rados/troubleshooting/  to provide some
> additional information, we may be better able to help you.
> 
> For example, "ceph osd tree" would help us understand the status of
> your cluster a bit better.
> 
> On Thu, May 16, 2013 at 10:32 PM, Olivier Bonvalet <ceph.l...@daevel.fr> 
> wrote:
> > Le mercredi 15 mai 2013 à 00:15 +0200, Olivier Bonvalet a écrit :
> >> Hi,
> >>
> >> I have some PG in state down and/or incomplete on my cluster, because I
> >> loose 2 OSD and a pool was having only 2 replicas. So of course that
> >> data is lost.
> >>
> >> My problem now is that I can't retreive a "HEALTH_OK" status : if I try
> >> to remove, read or overwrite the corresponding RBD images, near all OSD
> >> hang (well... they don't do anything and requests stay in a growing
> >> queue, until the production will be done).
> >>
> >> So, what can I do to remove that corrupts images ?
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> > Up. Nobody can help me on that problem ?
> >
> > Thanks,
> >
> > Olivier
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> John Wilkins
> Senior Technical Writer
> Intank
> john.wilk...@inktank.com
> (415) 425-9599
> http://inktank.com
> 


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG down & incomplete

Reply via email to