You shouldn't let the cluster get so full that losing a few OSDs will make you go toofull. Letting the cluster get to 100% full is such a bad idea that you should make sure it doesn't happen.
Ceph is supposed to stop moving data to an OSD once that OSD hits osd_backfill_full_ratio, which defaults to 0.85. Any disk at 86% full will stop backfilling. I have verified this works when the disks fill up while the cluster is healthy, but I haven't failed a disk once I'm in the toofull state. Even so, mon_osd_full_ratio (default 0.95) or osd_failsafe_full_ratio (default 0.97) should stop all IO until a human gets involved. The only gotcha I can find is that the values are percentages, and the test is a "greater than" done with two significant digits. ie, if the osd_backfill_full_ratio is 0.85, it will continue backfilling until the disk is 86% full. So values are 0.99 and 1.00 will cause problems. On Mon, Nov 17, 2014 at 6:50 PM, han vincent <hang...@gmail.com> wrote: > hi, craig: > > Your solution did work very well. But if the data is very > important, when remove directory of PG from OSDs, a small mistake will > result in loss of data. And if cluster is very large, do not you think > delete the data on the disk from 100% to 95% is a tedious and > error-prone thing, for so many OSDs, large disks, and so on. > > so my key question is: if there is no space in the cluster while > some OSDs crashed, why the cluster should choose to migrate? And in > the migrating, other > OSDs will crashed one by one until the cluster could not work. > > 2014-11-18 5:28 GMT+08:00 Craig Lewis <cle...@centraldesktop.com>: > > At this point, it's probably best to delete the pool. I'm assuming the > pool > > only contains benchmark data, and nothing important. > > > > Assuming you can delete the pool: > > First, figure out the ID of the data pool. You can get that from ceph > osd > > dump | grep '^pool' > > > > Once you have the number, delete the data pool: rados rmpool data data > > --yes-i-really-really-mean-it > > > > That will only free up space on OSDs that are up. You'll need to > manually > > some PGs on the OSDs that are 100% full. Go to > > /var/lib/ceph/osd/ceph-<OSDID>/current, and delete a few directories that > > start with your data pool ID. You don't need to delete all of them. > Once > > the disk is below 95% full, you should be able to start that OSD. Once > it's > > up, it will finish deleting the pool. > > > > If you can't delete the pool, it is possible, but it's more work, and you > > still run the risk of losing data if you make a mistake. You need to > > disable backfilling, then delete some PGs on each OSD that's full. Try to > > only delete one copy of each PG. If you delete every copy of a PG on all > > OSDs, then you lost the data that was in that PG. As before, once you > > delete enough that the disk is less than 95% full, you can start the OSD. > > Once you start it, start deleting your benchmark data out of the data > pool. > > Once that's done, you can re-enable backfilling. You may need to scrub > or > > deep-scrub the OSDs you deleted data from to get everything back to > normal. > > > > > > So how did you get the disks 100% full anyway? Ceph normally won't let > you > > do that. Did you increase mon_osd_full_ratio, osd_backfill_full_ratio, > or > > osd_failsafe_full_ratio? > > > > > > On Mon, Nov 17, 2014 at 7:00 AM, han vincent <hang...@gmail.com> wrote: > >> > >> hello, every one: > >> > >> These days a problem of "ceph" has troubled me for a long time. > >> > >> I build a cluster with 3 hosts and each host has three osds in it. > >> And after that > >> I used the command "rados bench 360 -p data -b 4194304 -t 300 write > >> --no-cleanup" > >> to test the write performance of the cluster. > >> > >> When the cluster is near full, there couldn't write any data to > >> it. Unfortunately, > >> there was a host hung up, then a lots of PG was going to migrate to > other > >> OSDs. > >> After a while, a lots of OSD was marked down and out, my cluster > couldn't > >> work > >> any more. > >> > >> The following is the output of "ceph -s": > >> > >> cluster 002c3742-ab04-470f-8a7a-ad0658b547d6 > >> health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs > >> incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625 > >> pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean; > >> recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons > >> down, quorum 0,2 2,1 > >> monmap e1: 3 mons at > >> {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election > >> epoch 40, quorum 0,2 2,1 > >> osdmap e173: 9 osds: 2 up, 2 in > >> flags full > >> pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects > >> 37541 MB used, 3398 MB / 40940 MB avail > >> 945/29649 objects degraded (3.187%) > >> 34 stale+active+degraded+remapped > >> 176 stale+incomplete > >> 320 stale+down+peering > >> 53 active+degraded+remapped > >> 408 incomplete > >> 1 active+recovering+degraded > >> 673 down+peering > >> 1 stale+active+degraded > >> 15 remapped+peering > >> 3 stale+active+recovering+degraded+remapped > >> 3 active+degraded > >> 33 remapped+incomplete > >> 8 active+recovering+degraded+remapped > >> > >> The following is the output of "ceph osd tree": > >> # id weight type name up/down reweight > >> -1 9 root default > >> -3 9 rack unknownrack > >> -2 3 host 10.0.0.97 > >> 0 1 osd.0 down 0 > >> 1 1 osd.1 down 0 > >> 2 1 osd.2 down 0 > >> -4 3 host 10.0.0.98 > >> 3 1 osd.3 down 0 > >> 4 1 osd.4 down 0 > >> 5 1 osd.5 down 0 > >> -5 3 host 10.0.0.70 > >> 6 1 osd.6 up 1 > >> 7 1 osd.7 up 1 > >> 8 1 osd.8 down 0 > >> > >> The following is part of output os osd.0.log > >> > >> -3> 2014-11-14 17:33:02.166022 7fd9dd1ab700 0 > >> filestore(/data/osd/osd.0) error (28) No space left on device not > >> handled on operation 10 (15804.0.13, or op 13, counting from 0) > >> -2> 2014-11-14 17:33:02.216768 7fd9dd1ab700 0 > >> filestore(/data/osd/osd.0) ENOSPC handling not implemented > >> -1> 2014-11-14 17:33:02.216783 7fd9dd1ab700 0 > >> filestore(/data/osd/osd.0) transaction dump: > >> ... > >> ... > >> 0> 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1 os/FileStore.cc: In > >> function 'unsigned int > >> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, > >> ThreadPool::TPHandle*)' thread 7fd9dd1ab700 time > >> 2014-11-14 17:33:02.251570 > >> os/FileStore.cc: 2540: FAILED assert(0 == "unexpected error") > >> > >> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) > >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > >> const*)+0x85) [0x17f8675] > >> 2: (FileStore::_do_transaction(ObjectStore::Transaction&, > >> unsigned long, int, ThreadPool::TPHandle*)+0x4855) [0x1534c21] > >> 3: > (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, > >> std::allocator<ObjectStore::Transaction*> >&, unsigned long, > >> ThreadPool::TPHandle*)+0x101) [0x152d67d] > >> 4: (FileStore::_do_op(FileStore::OpSequencer*, > >> ThreadPool::TPHandle&)+0x57b) [0x152bdc3] > >> 5: (FileStore::OpWQ::_process(FileStore::OpSequencer*, > >> ThreadPool::TPHandle&)+0x2f) [0x1553c6f] > >> 6: > >> (ThreadPool::WorkQueue<FileStore::OpSequencer>::_void_process(void*, > >> ThreadPool::TPHandle&)+0x37) [0x15625e7] > >> 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x7a4) [0x18801de] > >> 8: (ThreadPool::WorkThread::entry()+0x23) [0x1881f2d] > >> 9: (Thread::_entry_func(void*)+0x23) [0x1998117] > >> 10: (()+0x79d1) [0x7fd9e92bf9d1] > >> 11: (clone()+0x6d) [0x7fd9e78ca9dd] > >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is > >> needed to interpret this. > >> > >> It seens the error code was ENOSPC(No space left), why the osd > >> program exited with "assert" at > >> this time? If there was no space left, why the cluster should choose > >> to migrate? Only osd.6 > >> and osd.7 was alive. I tried to restarted other OSDs, but after a > >> while, there osds crashed again. > >> And now I can't read the data any more. > >> Is it a bug? Anyone can help me? > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com