Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

Jeffrey McDonald Mon, 07 Mar 2016 13:57:43 -0800

Do you want me to enable this for the pg already with unfound objects or
the placement group just scrubbed and now inconsistent?
Jeff


On Mon, Mar 7, 2016 at 3:54 PM, Samuel Just <sj...@redhat.com> wrote:

> Can you enable
>
> debug osd = 20
> debug filestore = 20
> debug ms = 1
>
> on all osds in that PG, rescrub, and convey to us the resulting logs?
> -Sam
>
> On Mon, Mar 7, 2016 at 1:36 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> > Here is a PG which just went inconsistent:
> >
> > pg 70.459 is active+clean+inconsistent, acting [307,210,273,191,132,450]
> >
> > Attached is the result of a pg query on this.   I will wait for your
> > feedback before issuing a repair.
> >
> > From what I read, the inconsistencies are more likely the result of ntp,
> but
> > all nodes have the local ntp master and all are showing sync.
> >
> > Regards,
> > Jeff
> >
> > On Mon, Mar 7, 2016 at 3:15 PM, Gregory Farnum <gfar...@redhat.com>
> wrote:
> >>
> >> [ Keeping this on the users list. ]
> >>
> >> Okay, so next time this happens you probably want to do a pg query on
> >> the PG which has been reported as dirty. I can't help much beyond
> >> that, but hopefully Kefu or David will chime in once there's a little
> >> more for them to look at.
> >> -Greg
> >>
> >> On Mon, Mar 7, 2016 at 1:00 PM, Jeffrey McDonald <jmcdo...@umn.edu>
> wrote:
> >> > Hi Greg,
> >> >
> >> > I'm running the ceph version hammer,
> >> > ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
> >> >
> >> > The hardware migration was performed by just setting the crush map to
> >> > zero
> >> > for the OSD we wanted to retire.   The system was performing poorly
> with
> >> > these older OSDs and we had a difficult time maintaining stability of
> >> > the
> >> > system.    The old OSDs are still there but all of the data is now
> >> > migrated
> >> > to new and/or existing hardware.
> >> >
> >> > Thanks,
> >> > Jeff
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Mar 7, 2016 at 2:56 PM, Gregory Farnum <gfar...@redhat.com>
> >> > wrote:
> >> >>
> >> >> On Mon, Mar 7, 2016 at 12:07 PM, Jeffrey McDonald <jmcdo...@umn.edu>
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > For a while, we've been seeing inconsistent placement groups on our
> >> >> > erasure
> >> >> > coded system.   The placement groups go from a state of
> active+clean
> >> >> > to
> >> >> > active+clean+inconsistent after a deep scrub:
> >> >> >
> >> >> >
> >> >> > 2016-03-07 13:45:42.044131 7f385d118700 -1 log_channel(cluster) log
> >> >> > [ERR] :
> >> >> > 70.320s0 deep-scrub stat mismatch, got 21446/21428 objects, 0/0
> >> >> > clones,
> >> >> > 21446/21428 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
> >> >> > 64682334170/64624353083 bytes,0/0 hit_set_archive bytes.
> >> >> > 2016-03-07 13:45:42.044416 7f385d118700 -1 log_channel(cluster) log
> >> >> > [ERR] :
> >> >> > 70.320s0 deep-scrub 18 missing, 0 inconsistent objects
> >> >> > 2016-03-07 13:45:42.044464 7f385d118700 -1 log_channel(cluster) log
> >> >> > [ERR] :
> >> >> > 70.320 deep-scrub 73 errors
> >> >> >
> >> >> > So I tell the placement group to perform a repair:
> >> >> >
> >> >> > 2016-03-07 13:49:26.047177 7f385d118700  0 log_channel(cluster) log
> >> >> > [INF] :
> >> >> > 70.320 repair starts
> >> >> > 2016-03-07 13:49:57.087291 7f3858b0a700  0 -- 10.31.0.2:6874/13937
> >>
> >> >> > 10.31.0.6:6824/8127 pipe(0x2e578000 sd=697 :6874
> >> >> >
> >> >> > The repair finds missing shards and repairs them, but then I have
> 18
> >> >> > 'unfound objects' :
> >> >> >
> >> >> >
> >> >> > 2016-03-07 13:51:28.467590 7f385d118700 -1 log_channel(cluster) log
> >> >> > [ERR] :
> >> >> > 70.320s0 repair stat mismatch, got 21446/21428 objects, 0/0 clones,
> >> >> > 21446/21428 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
> >> >> > 64682334170/64624353083 bytes,0/0 hit_set_archive bytes.
> >> >> > 2016-03-07 13:51:28.468358 7f385d118700 -1 log_channel(cluster) log
> >> >> > [ERR] :
> >> >> > 70.320s0 repair 18 missing, 0 inconsistent objects
> >> >> > 2016-03-07 13:51:28.469431 7f385d118700 -1 log_channel(cluster) log
> >> >> > [ERR] :
> >> >> > 70.320 repair 73 errors, 73 fixed
> >> >> >
> >> >> >
> >> >> > I've traced one of the unfound objects all the way through the
> system
> >> >> > and
> >> >> > I've found that they are not really lost.   I can fail over the osd
> >> >> > and
> >> >> > recover the files.   This is happening quite regularly now after a
> >> >> > large
> >> >> > migration of data from old hardware to new(migration is now
> >> >> > complete).
> >> >> >
> >> >> > The system sets the PG into 'recovery', but we've seen the system
> in
> >> >> > a
> >> >> > recovering state for many days.    Should we just be patient or do
> we
> >> >> > need
> >> >> > to dig further into the issue?
> >> >>
> >> >> You may need to dig into this more, although I'm not sure what the
> >> >> issue is likely to be. What version of Ceph are you running? How did
> >> >> you do this hardware migration?
> >> >> -Greg
> >> >>
> >> >> >
> >> >> >
> >> >> > pg 70.320 is stuck unclean for 704.803040, current state
> >> >> > active+recovering,
> >> >> > last acting [277,101,218,49,304,412]
> >> >> > pg 70.320 is active+recovering, acting [277,101,218,49,304,412], 18
> >> >> > unfound
> >> >> >
> >> >> > There is no indication of any problems with down OSDs or network
> >> >> > issues
> >> >> > with
> >> >> > OSDs.
> >> >> >
> >> >> > Thanks,
> >> >> > Jeff
> >> >> >
> >> >> >
> >> >> > --
> >> >> >
> >> >> > Jeffrey McDonald, PhD
> >> >> > Assistant Director for HPC Operations
> >> >> > Minnesota Supercomputing Institute
> >> >> > University of Minnesota Twin Cities
> >> >> > 599 Walter Library           email: jeffrey.mcdon...@msi.umn.edu
> >> >> > 117 Pleasant St SE           phone: +1 612 625-6905
> >> >> > Minneapolis, MN 55455        fax:   +1 612 624-8861
> >> >> >
> >> >> >
> >> >> >
> >> >> > _______________________________________________
> >> >> > ceph-users mailing list
> >> >> > ceph-users@lists.ceph.com
> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Jeffrey McDonald, PhD
> >> > Assistant Director for HPC Operations
> >> > Minnesota Supercomputing Institute
> >> > University of Minnesota Twin Cities
> >> > 599 Walter Library           email: jeffrey.mcdon...@msi.umn.edu
> >> > 117 Pleasant St SE           phone: +1 612 625-6905
> >> > Minneapolis, MN 55455        fax:   +1 612 624-8861
> >> >
> >> >
> >
> >
> >
> >
> > --
> >
> > Jeffrey McDonald, PhD
> > Assistant Director for HPC Operations
> > Minnesota Supercomputing Institute
> > University of Minnesota Twin Cities
> > 599 Walter Library           email: jeffrey.mcdon...@msi.umn.edu
> > 117 Pleasant St SE           phone: +1 612 625-6905
> > Minneapolis, MN 55455        fax:   +1 612 624-8861
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 

Jeffrey McDonald, PhD
Assistant Director for HPC Operations
Minnesota Supercomputing Institute
University of Minnesota Twin Cities
599 Walter Library           email: jeffrey.mcdon...@msi.umn.edu
117 Pleasant St SE           phone: +1 612 625-6905
Minneapolis, MN 55455        fax:   +1 612 624-8861

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

Reply via email to