You shouldn't let the cluster get so full that losing a few OSDs will make
you go toofull.  Letting the cluster get to 100% full is such a bad idea
that you should make sure it doesn't happen.


Ceph is supposed to stop moving data to an OSD once that OSD hits
osd_backfill_full_ratio, which defaults to 0.85.  Any disk at 86% full will
stop backfilling.

I have verified this works when the disks fill up while the cluster is
healthy, but I haven't failed a disk once I'm in the toofull state.  Even
so, mon_osd_full_ratio (default 0.95) or osd_failsafe_full_ratio (default
0.97) should stop all IO until a human gets involved.

The only gotcha I can find is that the values are percentages, and the test
is a "greater than" done with two significant digits.  ie, if the
osd_backfill_full_ratio is 0.85, it will continue backfilling until the
disk is 86% full.  So values are 0.99 and 1.00 will cause problems.


On Mon, Nov 17, 2014 at 6:50 PM, han vincent <hang...@gmail.com> wrote:

> hi, craig:
>
>     Your solution did work very well. But if the data is very
> important, when remove directory of PG from OSDs, a small mistake will
> result in loss of data. And if cluster is very large, do not you think
> delete the data on the disk from 100% to 95% is a tedious and
> error-prone thing, for so many OSDs, large disks, and so on.
>
>      so my key question is: if there is no space in the cluster while
> some OSDs crashed,  why the cluster should choose to migrate? And in
> the migrating, other
> OSDs will crashed one by one until the cluster could not work.
>
> 2014-11-18 5:28 GMT+08:00 Craig Lewis <cle...@centraldesktop.com>:
> > At this point, it's probably best to delete the pool.  I'm assuming the
> pool
> > only contains benchmark data, and nothing important.
> >
> > Assuming you can delete the pool:
> > First, figure out the ID of the data pool.  You can get that from ceph
> osd
> > dump | grep '^pool'
> >
> > Once you have the number, delete the data pool: rados rmpool data data
> > --yes-i-really-really-mean-it
> >
> > That will only free up space on OSDs that are up.  You'll need to
> manually
> > some PGs on the OSDs that are 100% full.  Go to
> > /var/lib/ceph/osd/ceph-<OSDID>/current, and delete a few directories that
> > start with your data pool ID.  You don't need to delete all of them.
> Once
> > the disk is below 95% full, you should be able to start that OSD.  Once
> it's
> > up, it will finish deleting the pool.
> >
> > If you can't delete the pool, it is possible, but it's more work, and you
> > still run the risk of losing data if you make a mistake.  You need to
> > disable backfilling, then delete some PGs on each OSD that's full. Try to
> > only delete one copy of each PG.  If you delete every copy of a PG on all
> > OSDs, then you lost the data that was in that PG.  As before, once you
> > delete enough that the disk is less than 95% full, you can start the OSD.
> > Once you start it, start deleting your benchmark data out of the data
> pool.
> > Once that's done, you can re-enable backfilling.  You may need to scrub
> or
> > deep-scrub the OSDs you deleted data from to get everything back to
> normal.
> >
> >
> > So how did you get the disks 100% full anyway?  Ceph normally won't let
> you
> > do that.  Did you increase mon_osd_full_ratio, osd_backfill_full_ratio,
> or
> > osd_failsafe_full_ratio?
> >
> >
> > On Mon, Nov 17, 2014 at 7:00 AM, han vincent <hang...@gmail.com> wrote:
> >>
> >> hello, every one:
> >>
> >>     These days a problem of "ceph" has troubled me for a long time.
> >>
> >>     I build a cluster with 3 hosts and each host has three osds in it.
> >> And after that
> >> I used the command "rados bench 360 -p data -b 4194304 -t 300 write
> >> --no-cleanup"
> >> to test the write performance of the cluster.
> >>
> >>     When the cluster is near full, there couldn't write any data to
> >> it. Unfortunately,
> >> there was a host hung up, then a lots of PG was going to migrate to
> other
> >> OSDs.
> >> After a while, a lots of OSD was marked down and out, my cluster
> couldn't
> >> work
> >> any more.
> >>
> >>     The following is the output of "ceph -s":
> >>
> >>     cluster 002c3742-ab04-470f-8a7a-ad0658b547d6
> >>     health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs
> >> incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625
> >> pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean;
> >> recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons
> >> down, quorum 0,2 2,1
> >>      monmap e1: 3 mons at
> >> {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election
> >> epoch 40, quorum 0,2 2,1
> >>      osdmap e173: 9 osds: 2 up, 2 in
> >>             flags full
> >>       pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects
> >>             37541 MB used, 3398 MB / 40940 MB avail
> >>             945/29649 objects degraded (3.187%)
> >>                   34 stale+active+degraded+remapped
> >>                  176 stale+incomplete
> >>                  320 stale+down+peering
> >>                   53 active+degraded+remapped
> >>                  408 incomplete
> >>                    1 active+recovering+degraded
> >>                  673 down+peering
> >>                    1 stale+active+degraded
> >>                   15 remapped+peering
> >>                    3 stale+active+recovering+degraded+remapped
> >>                    3 active+degraded
> >>                   33 remapped+incomplete
> >>                    8 active+recovering+degraded+remapped
> >>
> >>     The following is the output of "ceph osd tree":
> >>     # id    weight  type name       up/down reweight
> >>     -1      9       root default
> >>     -3      9               rack unknownrack
> >>     -2      3                       host 10.0.0.97
> >>      0       1                               osd.0   down    0
> >>      1       1                               osd.1   down    0
> >>      2       1                               osd.2   down    0
> >>      -4      3                       host 10.0.0.98
> >>      3       1                               osd.3   down    0
> >>      4       1                               osd.4   down    0
> >>      5       1                               osd.5   down    0
> >>      -5      3                       host 10.0.0.70
> >>      6       1                               osd.6   up      1
> >>      7       1                               osd.7   up      1
> >>      8       1                               osd.8   down    0
> >>
> >> The following is part of output os osd.0.log
> >>
> >>     -3> 2014-11-14 17:33:02.166022 7fd9dd1ab700  0
> >> filestore(/data/osd/osd.0)  error (28) No space left on device not
> >> handled on operation 10 (15804.0.13, or op 13, counting from 0)
> >>     -2> 2014-11-14 17:33:02.216768 7fd9dd1ab700  0
> >> filestore(/data/osd/osd.0) ENOSPC handling not implemented
> >>     -1> 2014-11-14 17:33:02.216783 7fd9dd1ab700  0
> >> filestore(/data/osd/osd.0)  transaction dump:
> >>     ...
> >>     ...
> >>     0> 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1 os/FileStore.cc: In
> >> function 'unsigned int
> >> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
> >> ThreadPool::TPHandle*)' thread 7fd9dd1ab700             time
> >> 2014-11-14 17:33:02.251570
> >>       os/FileStore.cc: 2540: FAILED assert(0 == "unexpected error")
> >>
> >>       ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
> >>      1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x85) [0x17f8675]
> >>      2: (FileStore::_do_transaction(ObjectStore::Transaction&,
> >> unsigned long, int, ThreadPool::TPHandle*)+0x4855)         [0x1534c21]
> >>      3:
> (FileStore::_do_transactions(std::list<ObjectStore::Transaction*,
> >> std::allocator<ObjectStore::Transaction*> >&,      unsigned long,
> >> ThreadPool::TPHandle*)+0x101) [0x152d67d]
> >>      4: (FileStore::_do_op(FileStore::OpSequencer*,
> >> ThreadPool::TPHandle&)+0x57b) [0x152bdc3]
> >>      5: (FileStore::OpWQ::_process(FileStore::OpSequencer*,
> >> ThreadPool::TPHandle&)+0x2f) [0x1553c6f]
> >>      6:
> >> (ThreadPool::WorkQueue<FileStore::OpSequencer>::_void_process(void*,
> >> ThreadPool::TPHandle&)+0x37)      [0x15625e7]
> >>      7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x7a4) [0x18801de]
> >>      8: (ThreadPool::WorkThread::entry()+0x23) [0x1881f2d]
> >>      9: (Thread::_entry_func(void*)+0x23) [0x1998117]
> >>     10: (()+0x79d1) [0x7fd9e92bf9d1]
> >>     11: (clone()+0x6d) [0x7fd9e78ca9dd]
> >>     NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> >> needed to interpret this.
> >>
> >>     It seens the error code was ENOSPC(No space left), why the osd
> >> program exited with "assert" at
> >> this time? If there was no space left, why the cluster should choose
> >> to migrate? Only osd.6
> >> and osd.7 was alive. I tried to restarted other OSDs, but after a
> >> while, there osds crashed again.
> >> And now I can't read the data any more.
> >>     Is it a bug? Anyone can help me?
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to