Re: [ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)

linghucongsong Fri, 28 Jul 2017 02:44:02 -0700


It look like the osd in your cluster is not all the same size.


can you show ceph osd df output?


At 2017-07-28 17:24:29, "Nikola Ciprich" <nikola.cipr...@linuxbox.cz> wrote:
>I forgot to add that OSD daemons really seem to be idle, no disk
>activity, no CPU usage.. it just looks to me like  some kind of
>deadlock, as they were waiting for each other..
>
>and so I'm trying to get last 1.5% of misplaced / degraded PGs
>for almost a week..
>
>
>On Fri, Jul 28, 2017 at 10:56:02AM +0200, Nikola Ciprich wrote:
>> Hi,
>> 
>> I'm trying to find reason for strange recovery issues I'm seeing on
>> our cluster..
>> 
>> it's mostly idle, 4 node cluster with 26 OSDs evenly distributed
>> across nodes. jewel 10.2.9
>> 
>> the problem is that after some disk replaces and data moves, recovery
>> is progressing extremely slowly.. pgs seem to be stuck in 
>> active+recovering+degraded
>> state:
>> 
>> [root@v1d ~]# ceph -s
>>     cluster a5efbc87-3900-4c42-a977-8c93f7aa8c33
>>      health HEALTH_WARN
>>             159 pgs backfill_wait
>>             4 pgs backfilling
>>             259 pgs degraded
>>             12 pgs recovering
>>             113 pgs recovery_wait
>>             215 pgs stuck degraded
>>             266 pgs stuck unclean
>>             140 pgs stuck undersized
>>             151 pgs undersized
>>             recovery 37788/2327775 objects degraded (1.623%)
>>             recovery 23854/2327775 objects misplaced (1.025%)
>>             noout,noin flag(s) set
>>      monmap e21: 3 mons at 
>> {v1a=10.0.0.1:6789/0,v1b=10.0.0.2:6789/0,v1c=10.0.0.3:6789/0}
>>             election epoch 6160, quorum 0,1,2 v1a,v1b,v1c
>>       fsmap e817: 1/1/1 up {0=v1a=up:active}, 1 up:standby
>>      osdmap e76002: 26 osds: 26 up, 26 in; 185 remapped pgs
>>             flags noout,noin,sortbitwise,require_jewel_osds
>>       pgmap v80995844: 3200 pgs, 4 pools, 2876 GB data, 757 kobjects
>>             9215 GB used, 35572 GB / 45365 GB avail
>>             37788/2327775 objects degraded (1.623%)
>>             23854/2327775 objects misplaced (1.025%)
>>                 2912 active+clean
>>                  130 active+undersized+degraded+remapped+wait_backfill
>>                   97 active+recovery_wait+degraded
>>                   29 active+remapped+wait_backfill
>>                   12 active+recovery_wait+undersized+degraded+remapped
>>                    6 active+recovering+degraded
>>                    5 active+recovering+undersized+degraded+remapped
>>                    4 active+undersized+degraded+remapped+backfilling
>>                    4 active+recovery_wait+degraded+remapped
>>                    1 active+recovering+degraded+remapped
>>   client io 2026 B/s rd, 146 kB/s wr, 9 op/s rd, 21 op/s wr
>> 
>> 
>>  when I restart affected OSDs, it bumps the recovery, but then another
>> PGs get stuck.. All OSDs were restarted multiple times, none are even close 
>> to
>> nearfull, I just cant find what I'm doing wrong..
>> 
>> possibly related OSD options:
>> 
>> osd max backfills = 4
>> osd recovery max active = 15
>> debug osd = 0/0
>> osd op threads = 4
>> osd backfill scan min = 4
>> osd backfill scan max = 16
>> 
>> Any hints would be greatly appreciated
>> 
>> thanks
>> 
>> nik
>> 
>> 
>> -- 
>> -------------------------------------
>> Ing. Nikola CIPRICH
>> LinuxBox.cz, s.r.o.
>> 28.rijna 168, 709 00 Ostrava
>> 
>> tel.:   +420 591 166 214
>> fax:    +420 596 621 273
>> mobil:  +420 777 093 799
>> www.linuxbox.cz
>> 
>> mobil servis: +420 737 238 656
>> email servis: ser...@linuxbox.cz
>> -------------------------------------
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>
>-- 
>-------------------------------------
>Ing. Nikola CIPRICH
>LinuxBox.cz, s.r.o.
>28.rijna 168, 709 00 Ostrava
>
>tel.:   +420 591 166 214
>fax:    +420 596 621 273
>mobil:  +420 777 093 799
>www.linuxbox.cz
>
>mobil servis: +420 737 238 656
>email servis: ser...@linuxbox.cz
>-------------------------------------
>_______________________________________________
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)

Reply via email to