Re: [ceph-users] OSD_OUT_OF_ORDER_FULL even when the ratios are in order.

dE . Thu, 14 Sep 2017 23:56:06 -0700

Hi,
    This's just a test cluster so, I'm just testing the relationship
between these ratios.


I did changes as such --
failsafe_full = 1 (osd failsafe full ratio = 1 in ceph.conf)
backfillfull = 0.99
nearfull = 0.95
full = 0.96

But the ceph health detail output shows a different story (different from
what I set) --
OSD_OUT_OF_ORDER_FULL full ratio(s) out of order
    full_ratio (0.96) < backfillfull_ratio (0.99), increased
    osd_failsafe_full_ratio (0.97) < full_ratio (0.99), increased

Also as per the documentation
<http://docs.ceph.com/docs/master/rados/operations/health-checks/#osd-out-of-order-full>,
the expected order must be, backfillfull < nearfull.

Thanks of the response!

On Fri, Sep 15, 2017 at 1:27 AM, David Turner <drakonst...@gmail.com> wrote:

> The warning you are seeing is because those settings are out of order and
> it's showing you which ones are greater than the ones they should be.
>  backfillfull_ratio is supposed to be higher than nearfull_ratio and
> osd_failsafe_full_ratio is supposed to be higher than full_ratio.
>  nearfull_ratio is a warning that shows up in your ceph status, but doesn't
> prevent anything from happening; backfillfull_ratio prevents backfilling
> from happening; and full_ratio prevents any IO from happening at all.
>
> That is the answer to your question, but below is addressing the
> ridiculous values you are trying to set those to.
>
> Why are you using such high ratios?  By default 5% of the disk is reserved
> by root for root and nobody but root.  I think that can be adjusted when
> you create the filesystem, but I am unaware if ceph-deploy does that or
> not.  But if that is the setting and if you're running your OSDs as user
> ceph (Jewel or later), then they will cap out at 95% full and the OS will
> fail to write to the OSD disk.
>
> (assuming you set your ratios in the proper order) You are leaving
> yourself no room for your cluster to recover from any sort of down osds or
> failed osds.  I don't know what disks you're using, but I don't know of any
> that are guaranteed not to fail.  If your disks can't perform any
> backfilling, then you can't recover from anything... including just
> restarting an osd daemon or a node...  Based on 97% nearfull being your
> setting... you're giving yourself a 2% warning period to add more storage
> before your cluster is incapable of receiving reads or writes.  BUT you
> also set your cluster to not be able to backfill anything if the OSD is
> over 98% full.  Those settings pretty much guarantee that you will be 100%
> stuck and unable to even add more storage to your cluster if you wait until
> your nearfull_ratio is triggered.
>
> I'm just going to say it... DON'T RUN WITH THESE SETTINGS EVER.  DON'T
> EVEN COME CLOSE TO THESE SETTINGS, THEY ARE TERRIBLE!!!
>
> 90% full_ratio is good (95% is the default) because it is a setting you
> can change and if you get into a situation where you need to recover your
> cluster and your cluster is full because of a failed node or anything, then
> you can change the full_ratio and have a chance to still recover your
> cluster.
>
> 80% nearfull_ratio is good (85% is the default) because it gives you 10%
> usable disk space for you to add more storage to your cluster or clean up
> cruft in your cluster that you don't need.  If it takes you a long time to
> get new hardware or find things to delete in your cluster, consider a lower
> number for this warning.
>
> 85% backfillfull_ratio is good (90% is the default) because of the same
> reason as full_ratio.  You can increase it if you need to for a critical
> recovery.  But with these setting a backfilling operation won't bring you
> too close to your full_ratio that you are in a high danger of blocking all
> IO to your cluster.
>
> Even if you stick with the defaults you're in a good enough situation
> where you will be most likely able to recover from most failures in your
> cluster.  But don't push them up unless you are in the middle of a
> catastrophic failure and you're doing it specifically to recover after you
> have your game-plan resolution in place.
>
>
>
> On Thu, Sep 14, 2017 at 10:03 AM Ronny Aasen <ronny+ceph-us...@aasen.cx>
> wrote:
>
>> On 14. sep. 2017 11:58, dE . wrote:
>> > Hi,
>> >      I got a ceph cluster where I'm getting a OSD_OUT_OF_ORDER_FULL
>> > health error, even though it appears that it is in order --
>> >
>> > full_ratio 0.99
>> > backfillfull_ratio 0.97
>> > nearfull_ratio 0.98
>> >
>> > These don't seem like a mistake to me but ceph is complaining --
>> > OSD_OUT_OF_ORDER_FULL full ratio(s) out of order
>> >      backfillfull_ratio (0.97) < nearfull_ratio (0.98), increased
>> >      osd_failsafe_full_ratio (0.97) < full_ratio (0.99), increased
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>
>> post output from
>>
>> ceph osd df
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD_OUT_OF_ORDER_FULL even when the ratios are in order.

Reply via email to