Okay, you're going to need to explain in very clear terms exactly what
happened to your cluster, and *exactly* what operations you performed
manually.

The PG shards seem to have different views of the PG in question. The
primary has a different log_tail, last_user_version, and last_epoch_clean
from the others. Plus different log sizes? It's not making a ton of sense
at first glance.
-Greg

On Thu, Oct 19, 2017 at 1:08 AM Stijn De Weirdt <stijn.dewei...@ugent.be>
wrote:

> hi greg,
>
> i attached the gzip output of the query and some more info below. if you
> need more, let me know.
>
> stijn
>
> > [root@mds01 ~]# ceph -s
> >     cluster 92beef0a-1239-4000-bacf-4453ab630e47
> >      health HEALTH_ERR
> >             1 pgs inconsistent
> >             40 requests are blocked > 512 sec
> >             1 scrub errors
> >             mds0: Behind on trimming (2793/30)
> >      monmap e1: 3 mons at {mds01=
> 1.2.3.4:6789/0,mds02=1.2.3.5:6789/0,mds03=1.2.3.6:6789/0}
> >             election epoch 326, quorum 0,1,2 mds01,mds02,mds03
> >       fsmap e238677: 1/1/1 up {0=mds02=up:active}, 2 up:standby
> >      osdmap e79554: 156 osds: 156 up, 156 in
> >             flags sortbitwise,require_jewel_osds
> >       pgmap v51003893: 4096 pgs, 3 pools, 387 TB data, 243 Mobjects
> >             545 TB used, 329 TB / 874 TB avail
> >                 4091 active+clean
> >                    4 active+clean+scrubbing+deep
> >                    1 active+clean+inconsistent
> >   client io 284 kB/s rd, 146 MB/s wr, 145 op/s rd, 177 op/s wr
> >   cache io 115 MB/s flush, 153 MB/s evict, 14 op/s promote, 3 PG(s)
> flushing
>
> > [root@mds01 ~]# ceph health detail
> > HEALTH_ERR 1 pgs inconsistent; 52 requests are blocked > 512 sec; 5 osds
> have slow requests; 1 scrub errors; mds0: Behind on trimming (2782/30)
> > pg 5.5e3 is active+clean+inconsistent, acting
> [35,50,91,18,139,59,124,40,104,12,71]
> > 34 ops are blocked > 524.288 sec on osd.8
> > 6 ops are blocked > 524.288 sec on osd.67
> > 6 ops are blocked > 524.288 sec on osd.27
> > 1 ops are blocked > 524.288 sec on osd.107
> > 5 ops are blocked > 524.288 sec on osd.116
> > 5 osds have slow requests
> > 1 scrub errors
> > mds0: Behind on trimming (2782/30)(max_segments: 30, num_segments: 2782)
>
> > # zgrep -C 1 ERR ceph-osd.35.log.*.gz
> > ceph-osd.35.log.5.gz:2017-10-14 11:25:52.260668 7f34d6748700  0 --
> 10.141.16.13:6801/1001792 >> 1.2.3.11:6803/1951 pipe(0x56412da80800
> sd=273 :6801 s=2 pgs=3176 cs=31 l=0 c=0x564156e83b00).fault with nothing to
> send, going to standby
> > ceph-osd.35.log.5.gz:2017-10-14 11:26:06.071011 7f3511be4700 -1
> log_channel(cluster) log [ERR] : 5.5e3s0 shard 59(5) missing
> 5:c7ae919b:::10014d3184b.00000000:head
> > ceph-osd.35.log.5.gz:2017-10-14 11:28:36.465684 7f34ffdf5700  0 --
> 1.2.3.13:6801/1001792 >> 1.2.3.21:6829/1834 pipe(0x56414e2a2000 sd=37
> :6801 s=0 pgs=0 cs=0 l=0 c=0x5641470d2a00).accept connect_seq 33 vs
> existing 33 state standby
> > ceph-osd.35.log.5.gz:--
> > ceph-osd.35.log.5.gz:2017-10-14 11:43:35.570711 7f3508efd700  0 --
> 1.2.3.13:6801/1001792 >> 1.2.3.20:6825/1806 pipe(0x56413be34000 sd=138
> :6801 s=2 pgs=2763 cs=45 l=0 c=0x564132999480).fault with nothing to send,
> going to standby
> > ceph-osd.35.log.5.gz:2017-10-14 11:44:02.235548 7f3511be4700 -1
> log_channel(cluster) log [ERR] : 5.5e3s0 deep-scrub 1 missing, 0
> inconsistent objects
> > ceph-osd.35.log.5.gz:2017-10-14 11:44:02.235554 7f3511be4700 -1
> log_channel(cluster) log [ERR] : 5.5e3 deep-scrub 1 errors
> > ceph-osd.35.log.5.gz:2017-10-14 11:59:02.331454 7f34d6d4e700  0 --
> 1.2.3.13:6801/1001792 >> 1.2.3.11:6817/1941 pipe(0x56414d370800 sd=227
> :42104 s=2 pgs=3238 cs=89 l=0 c=0x56413122d200).fault with nothing to send,
> going to standby
>
>
>
> On 10/18/2017 10:19 PM, Gregory Farnum wrote:
> > It would help if you can provide the exact output of "ceph -s", "pg
> query",
> > and any other relevant data. You shouldn't need to do manual repair of
> > erasure-coded pools, since it has checksums and can tell which bits are
> > bad. Following that article may not have done you any good (though I
> > wouldn't expect it to hurt, either...)...
> > -Greg
> >
> > On Wed, Oct 18, 2017 at 5:56 AM Stijn De Weirdt <stijn.dewei...@ugent.be
> >
> > wrote:
> >
> >> hi all,
> >>
> >> we have a ceph 10.2.7 cluster with a 8+3 EC pool.
> >> in that pool, there is a pg in inconsistent state.
> >>
> >> we followed http://ceph.com/geen-categorie/ceph-manually-repair-object/
> ,
> >> however, we are unable to solve our issue.
> >>
> >> from the primary osd logs, the reported pg had a missing object.
> >>
> >> we found a related object on the primary osd, and then looked for
> >> similar ones on the other osds in same path (i guess it is just has the
> >> index of the osd in the pg list of osds suffixed)
> >>
> >> one osd did not have such a file (the 10 others did).
> >>
> >> so we did the "stop osd/flush/start os/pg repair" on both the primary
> >> osd and on the osd with the missing EC part.
> >>
> >> however, the scrub error still exists.
> >>
> >> does anyone have any hints what to do in this case?
> >>
> >> stijn
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to