Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

Damian Dabrowski Sat, 07 Apr 2018 06:58:04 -0700

ok now I understand, thanks for all this helpful answers!

On Sat, Apr 7, 2018, 15:26 David Turner <drakonst...@gmail.com> wrote:


> I'm seconding what Greg is saying  There is no reason to set nobackfill
> and norecover just for restarting OSDs. That will only cause the problems
> you're seeing without giving you any benefit. There are reasons to use
> norecover and nobackfill but unless you're manually editing the crush map,
> having osds consistently segfault, or for any other reason you really just
> need to stop the io from recovery, then they aren't the flags for you. Even
> at that, nobackfill is most likely what you need and norecover is still
> probably not helpful.
>
> On Wed, Apr 4, 2018, 6:59 PM Gregory Farnum <gfar...@redhat.com> wrote:
>
>> On Thu, Mar 29, 2018 at 3:17 PM Damian Dabrowski <scoot...@gmail.com>
>> wrote:
>>
>>> Greg, thanks for Your reply!
>>>
>>> I think Your idea makes sense, I've did tests and its quite hard to
>>> understand for me. I'll try to explain my situation in few steps
>>> below.
>>> I think that ceph showing progress in recovery but it can only solve
>>> objects which doesn't really changed. It won't try to repair objects
>>> which are really degraded because of norecovery flag. Am i right?
>>> After a while I see blocked requests(as You can see below).
>>>
>>
>> Yeah, so the implementation of this is a bit funky. Basically, when the
>> OSD gets a map specifying norecovery, it will prevent any new recovery ops
>> from starting once it processes that map. But it doesn't change the state
>> of the PGs out of recovery; they just won't queue up more work.
>>
>> So probably the existing recovery IO was from OSDs that weren't
>> up-to-date yet. Or maybe there's a bug in the norecover implementation; it
>> definitely looks a bit fragile.
>>
>> But really I just wouldn't use that command. It's an expert flag that you
>> shouldn't use except in some extreme wonky cluster situations (and even
>> those may no longer exist in modern Ceph). For the use case you shared in
>> your first email, I'd just stick with noout.
>> -Greg
>>
>>
>>>
>>> ----- FEW SEC AFTER START OSD -----
>>> # ceph status
>>>     cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>>      health HEALTH_WARN
>>>             140 pgs degraded
>>>             1 pgs recovering
>>>             92 pgs recovery_wait
>>>             140 pgs stuck unclean
>>>             recovery 942/5772119 objects degraded (0.016%)
>>>             noout,nobackfill,norecover flag(s) set
>>>      monmap e10: 3 mons at
>>> {node-19=
>>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>>>             election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>>>      osdmap e18727: 36 osds: 36 up, 30 in
>>>             flags noout,nobackfill,norecover
>>>       pgmap v20851644: 1472 pgs, 7 pools, 8510 GB data, 1880 kobjects
>>>             25204 GB used, 17124 GB / 42329 GB avail
>>>             942/5772119 objects degraded (0.016%)
>>>                 1332 active+clean
>>>                   92 active+recovery_wait+degraded
>>>                   47 active+degraded
>>>                    1 active+recovering+degraded
>>> recovery io 31608 kB/s, 4 objects/s
>>>   client io 73399 kB/s rd, 80233 kB/s wr, 1218 op/s
>>>
>>> ----- 1 MIN AFTER OSD START, RECOVERY STUCK, BLOCKED REQUESTS -----
>>> # ceph status
>>>     cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>>      health HEALTH_WARN
>>>             140 pgs degraded
>>>             1 pgs recovering
>>>             109 pgs recovery_wait
>>>             140 pgs stuck unclean
>>>             80 requests are blocked > 32 sec
>>>             recovery 847/5775929 <(847)%20577-5929> objects degraded
>>> (0.015%)
>>>             noout,nobackfill,norecover flag(s) set
>>>      monmap e10: 3 mons at
>>> {node-19=
>>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>>>             election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>>>      osdmap e18727: 36 osds: 36 up, 30 in
>>>             flags noout,nobackfill,norecover
>>>       pgmap v20851812: 1472 pgs, 7 pools, 8520 GB data, 1881 kobjects
>>>             25234 GB used, 17094 GB / 42329 GB avail
>>>             847/5775929 <(847)%20577-5929> objects degraded (0.015%)
>>>                 1332 active+clean
>>>                  109 active+recovery_wait+degraded
>>>                   30 active+degraded <---- degraded objects count got
>>> stuck
>>>                    1 active+recovering+degraded
>>> recovery io 3743 kB/s, 0 objects/s <---- depend on command execution
>>> this line showing 0 objects/s or doesn't exists
>>>   client io 26521 kB/s rd, 64211 kB/s wr, 1212 op/s
>>>
>>> ----- FEW SECONDS AFTER UNSETTING FLAGS NOOUT, NORECOVERY, NOBACKFILL
>>> -----
>>> # ceph status
>>>     cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>>      health HEALTH_WARN
>>>             134 pgs degraded
>>>             134 pgs recovery_wait
>>>             134 pgs stuck degraded
>>>             134 pgs stuck unclean
>>>             recovery 591/5778179 objects degraded (0.010%)
>>>      monmap e10: 3 mons at
>>> {node-19=
>>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>>>             election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>>>      osdmap e18730: 36 osds: 36 up, 30 in
>>>       pgmap v20851909: 1472 pgs, 7 pools, 8526 GB data, 1881 kobjects
>>>             25252 GB used, 17076 GB / 42329 GB avail
>>>             591/5778179 objects degraded (0.010%)
>>>                 1338 active+clean
>>>                  134 active+recovery_wait+degraded
>>> recovery io 191 MB/s, 26 objects/s
>>>   client io 100654 kB/s rd, 184 MB/s wr, 6303 op/s
>>>
>>>
>>>
>>> 2018-03-29 18:22 GMT+02:00 Gregory Farnum <gfar...@redhat.com>:
>>> >
>>> > On Thu, Mar 29, 2018 at 7:27 AM Damian Dabrowski <scoot...@gmail.com>
>>> wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> Few days ago I had very strange situation.
>>> >>
>>> >> I had to turn off few OSDs for a while. So I've set flags:noout,
>>> >> nobackfill, norecover and then turned off selected OSDs.
>>> >> All was ok, but when I started these OSDs again all VMs went down due
>>> >> to recovery process(even when recovery priority was very low).
>>> >
>>> >
>>> > So you forbade the OSDs from doing any recovery work, but then you
>>> turned on
>>> > old ones that required recovery work to function properly?
>>> >
>>> > And your cluster stopped functioning?
>>> >
>>> >
>>> >>
>>> >>
>>> >> There's more important config values:
>>> >>     "osd_recovery_threads": "1",
>>> >>     "osd_recovery_thread_timeout": "30",
>>> >>     "osd_recovery_thread_suicide_timeout": "300",
>>> >>     "osd_recovery_delay_start": "0",
>>> >>     "osd_recovery_max_active": "1",
>>> >>     "osd_recovery_max_single_start": "5",
>>> >>     "osd_recovery_max_chunk": "8388608",
>>> >>     "osd_client_op_priority": "63",
>>> >>     "osd_recovery_op_priority": "1",
>>> >>     "osd_recovery_op_warn_multiple": "16",
>>> >>     "osd_backfill_full_ratio": "0.85",
>>> >>     "osd_backfill_retry_interval": "10",
>>> >>     "osd_backfill_scan_min": "64",
>>> >>     "osd_backfill_scan_max": "512",
>>> >>     "osd_kill_backfill_at": "0",
>>> >>     "osd_max_backfills": "1",
>>> >>
>>> >>
>>> >>
>>> >> I don't know why ceph started recovery process if there was
>>> >> norecovery&nobackfill flags enabled but the fact is that it killed all
>>> >> VMs.
>>> >
>>> >
>>> > Did it actually start recovering? Or you just saw client IO pause?
>>> > I confess I don’t know what the behavior will be like with that
>>> combined set
>>> > of flags, but I rather suspect it did what you told it to, and some
>>> PGs went
>>> > down as a result.
>>> > -Greg
>>> >
>>> >
>>> >>
>>> >>
>>> >> Next, I've turned off noout, nobackfill, norecover flags and it
>>> >> started to look better. VM's went back online and recovery process was
>>> >> still going. I didn't saw performance impact on SSD disks but there
>>> >> was huge impact on spinners.
>>> >> Normally %util is about 25%, but during recovery it was nearly 100%.
>>> >> CPU Load increased on HDD based VMs by ~400%.
>>> >>
>>> >> iostat fragment(during recovery):
>>> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>> >> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> >> sdh              0.30     1.00  150.90   36.00 13665.60   954.60
>>> >> 156.45    10.63   56.88   25.60  188.02   5.34  99.80
>>> >>
>>> >>
>>> >> Now, I'm little lost, I don't know answers for few questions.
>>> >> 1. Why ceph started recovery even if nobackfill&norecovery option was
>>> >> enabled?
>>> >> 2. Why recovery caused much more performance impact when
>>> >> norecovery&nobackfill options was enabled?
>>> >> 3. Why when norecovery&nobackfill was turned off, cluster started to
>>> >> look better but %util on HDD disks was so big(while
>>> >> recovery_op_priority=1 and client_op_priority=63)? 25% is normal,
>>> >> increased to 100% during recovery?
>>> >>
>>> >>
>>> >> Cluster information:
>>> >> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
>>> >> 3x nodes(CPU E5-2630, 32GB RAM, 6xHDD 2TB with SSD journal, 3x SSD 1TB
>>> >> with NVMe journal), triple replication
>>> >>
>>> >>
>>> >> I would be very grateful If somebody can help me.
>>> >> Sorry if I've done something in wrong way - this is my first time
>>> >> writing on mailing list.
>>> >> _______________________________________________
>>> >> ceph-users mailing list
>>> >> ceph-users@lists.ceph.com
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

Reply via email to