Re: [ceph-users] pages stuck unclean (but remapped)

Gautam Saxena Wed, 26 Feb 2014 07:09:01 -0800

That worked!

FYI (for others in case it helps): I think the key thing that fixed the
problems were a) set tunables to optimal and then b) resetting all the
weights back to 1 (per Gregory's suggestion below)



On Tue, Feb 25, 2014 at 1:03 PM, Gregory Farnum <g...@inktank.com> wrote:

> With the reweight-by-utilization applied, CRUSH is failing to generate
> mappings of enough OSDs, so the system is falling back to keeping
> around copies that already exist, even though they aren't located on
> the correct CRUSH-mapped OSDs (since there aren't enough OSDs).
> Are your OSDs correctly weighted in CRUSH by their size? If not, you
> want to apply that there and return all of the monitor override
> weights to 1.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, Feb 25, 2014 at 9:19 AM, Gautam Saxena <gsax...@i-a-inc.com>
> wrote:
> > So the "backfill_tooful" was an old state; it disappeared after I
> > reweighted. Yesterday, I even set up the Ceph system's tunables to
> optimal,
> > added one more osd, let it rebalance, and then after rebalancing, I ran a
> > "ceph osd reweight-by-utilization 105". After several hours, though, CEPH
> > stabilized (that is no more recovery), but the final state is worse than
> > before.  So here are my questions (I also included the results of "ceph
> -s"
> > right after these questions):
> >
> > 1) why are 153 pages in "active+remapped" but not going anywhere?
> Shouldn't
> > they be more like "active+remapped+wait_backfill" instead?
> > 2) Why are 10 pages "active+remapped+backfilling" but there is no actual
> > activity occurring in CEPH? Shouldn't it instead say
> > "active+remapped+wait_backfill+backfill_toofull"
> > 3) Why is there a backfill_tooful when all my osds are well under 95%
> full
> > -- in fact, they are all under 81% full (as determined by "df -h"
> command?)
> > (One theory I have is that the "too_full" percentage is based NOT on the
> > actual physical space on the OSD, but on the *reweighted* physical
> space. Is
> > this theory accurate?
> > 4) When I did a "ceph pg dump", I saw that all 153 pages that are in
> > active+remapped have only 1 OSD in the "up" state but 2 OSDs in the
> "acting"
> > state. I'm confused as to the difference between "up" and "acting" --
> does
> > this scenario mean that if I lose 1 OSD that in the "up" state, I lose
> data
> > for that page? Or does the "acting" mean that the page data is still on 2
> > OSDs, so I can afford to lose 1 OSD.
> >
> > --> ceph -s produces:
> >
> > ================
> > [root@ia2 ceph]# ceph -s
> >     cluster 14f78538-6085-43f9-ac80-e886ca4de119
> >      health HEALTH_WARN 10 pgs backfill; 5 pgs backfill_toofull; 10 pgs
> > backfilling; 173 pgs stuck unclean; recovery 44940/5858368 objects
> degraded
> > (0.767%)
> >      monmap e9: 3 mons at
> > {ia1=192.168.1.11:6789/0,ia2=192.168.1.12:6789/0,ia3=192.168.1.13:6789/0
> },
> > election epoch 500, quorum 0,1,2 ia1,ia2,ia3
> >      osdmap e9700: 23 osds: 23 up, 23 in
> >       pgmap v2003396: 1500 pgs, 1 pools, 11225 GB data, 2841 kobjects
> >             22452 GB used, 23014 GB / 45467 GB avail
> >             44940/5858368 objects degraded (0.767%)
> >                 1327 active+clean
> >                    5 active+remapped+wait_backfill
> >                    5 active+remapped+wait_backfill+backfill_toofull
> >                  153 active+remapped
> >                   10 active+remapped+backfilling
> >   client io 4369 kB/s rd, 64377 B/s wr, 26 op/s
> > ==========
> >
> >
> >
> > On Sun, Feb 23, 2014 at 8:09 PM, Gautam Saxena <gsax...@i-a-inc.com>
> wrote:
> >>
> >> I have 19 pages that are stuck unclean (see below result of ceph -s).
> This
> >> occurred after I executed a "ceph osd reweight-by-utilization 108" to
> >> resolve problems with "backfill_too_full" messages, which I believe
> occurred
> >> because my OSDs sizes vary significantly in size (from a low of 600GB
> to a
> >> high of 3 TB). How can I get ceph to get these pages out of
> stuck-unclean?
> >> (And why is this occurring anyways?) My best guess of how to fix
> (though I
> >> don't know why) is that I need to run:
> >>
> >> ceph osd crush tunables optimal.
> >>
> >> However, my kernel version (on a fully up-to-date Centos 6.5) is 2.6.32,
> >> which is well below the minimum required version of 3.6 that's stated
> in the
> >> documentation (http://ceph.com/docs/master/rados/operations/crush-map/)
> --
> >> so if I must run "ceph osd crush tunables optimal" to fix this problem,
> I
> >> presume I must upgrade my kernel first, right?...Any thoughts or am I
> >> chasing the wrong solution -- I want to avoid kernel upgrade unless it's
> >> needed.)
> >>
> >> =====================
> >>
> >> [root@ia2 ceph4]# ceph -s
> >>     cluster 14f78538-6085-43f9-ac80-e886ca4de119
> >>      health HEALTH_WARN 19 pgs backfilling; 19 pgs stuck unclean;
> recovery
> >> 42959/5511127 objects degraded (0.779%)
> >>      monmap e9: 3 mons at
> >> {ia1=
> 192.168.1.11:6789/0,ia2=192.168.1.12:6789/0,ia3=192.168.1.13:6789/0},
> >> election epoch 496, quorum 0,1,2 ia1,ia2,ia3
> >>      osdmap e7931: 23 osds: 23 up, 23 in
> >>       pgmap v1904820: 1500 pgs, 1 pools, 10531 GB data, 2670 kobjects
> >>             18708 GB used, 26758 GB / 45467 GB avail
> >>             42959/5511127 objects degraded (0.779%)
> >>                 1481 active+clean
> >>                   19 active+remapped+backfilling
> >>   client io 1457 B/s wr, 0 op/s
> >>
> >> [root@ia2 ceph4]# ceph -v
> >> ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
> >>
> >> [root@ia2 ceph4]# uname -r
> >> 2.6.32-431.3.1.el6.x86_64
> >>
> >> ====
> >
> >
> >
> >
> > --
> > Gautam Saxena
> > President & CEO
> > Integrated Analysis Inc.
> >
> > Making Sense of Data.(tm)
> > Biomarker Discovery Software | Bioinformatics Services | Data Warehouse
> > Consulting | Data Migration Consulting
> > www.i-a-inc.com
> > gsax...@i-a-inc.com
> > (301) 760-3077  office
> > (240) 479-4272  direct
> > (301) 560-3463  fax
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 
*Gautam Saxena *
President & CEO
Integrated Analysis Inc.

Making Sense of Data.(tm)
Biomarker Discovery Software | Bioinformatics Services | Data Warehouse
Consulting | Data Migration Consulting
www.i-a-inc.com  <http://www.i-a-inc.com/>
gsax...@i-a-inc.com
(301) 760-3077  office
(240) 479-4272  direct
(301) 560-3463  fax

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pages stuck unclean (but remapped)

Reply via email to