Re: [ceph-users] OSD is near full and slow in accessing storage from client

Sébastien VIGNERON Sun, 12 Nov 2017 03:15:00 -0800

Hi,

Have you tried to query pg state for some stuck or undersized pgs? Maybe some 
OSD daemons are not right, blocking the reconstruction.


ceph pg 3.be query
ceph pg 4.d4 query
ceph pg 4.8c query

http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-pg/

Cordialement / Best regards,

Sébastien VIGNERON 
CRIANN, 
Ingénieur / Engineer
Technopôle du Madrillet 
745, avenue de l'Université 
76800 Saint-Etienne du Rouvray - France 
tél. +33 2 32 91 42 91 
fax. +33 2 32 91 42 92 
http://www.criann.fr 
mailto:sebastien.vigne...@criann.fr
support: supp...@criann.fr

> Le 12 nov. 2017 à 10:59, gjprabu <gjpr...@zohocorp.com> a écrit :
> 
> Hi Sebastien
> 
>  Thanks for you reply , yes undersize pgs and recovery in process becuase of 
> we added new osd after getting 2 OSD is near full warning .   Yes newly added 
> osd is reblancing the size.
> 
> 
> [root@intcfs-osd6 ~]# ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL %USE  VAR  PGS
> 0 3.29749  1.00000  3376G  2875G  501G 85.15 1.26 165
> 1 3.26869  1.00000  3347G  1923G 1423G 57.46 0.85 152
> 2 3.27339  1.00000  3351G  1980G 1371G 59.08 0.88 161
> 3 3.24089  1.00000  3318G  2130G 1187G 64.21 0.95 168
> 4 3.24089  1.00000  3318G  2997G  320G 90.34 1.34 176
> 5 3.32669  1.00000  3406G  2466G  939G 72.42 1.07 165
> 6 3.27800  1.00000  3356G  1463G 1893G 43.60 0.65 166  
> 
> ceph osd crush rule dump
> 
> [
>     {
>         "rule_id": 0,
>         "rule_name": "replicated_ruleset",
>         "ruleset": 0,
>         "type": 1,
>         "min_size": 1,
>         "max_size": 10,
>         "steps": [
>             {
>                 "op": "take",
>                 "item": -1,
>                 "item_name": "default"
>             },
>             {
>                 "op": "chooseleaf_firstn",
>                 "num": 0,
>                 "type": "host"
>             },
>             {
>                 "op": "emit"
>             }
>         ]
>     }
> ]
> 
> 
> ceph version 10.2.2 and ceph version 10.2.9
> 
> 
> ceph osd pool ls detail
> 
> pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
> rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
> pool 3 'downloads_data' replicated size 2 min_size 1 crush_ruleset 0 
> object_hash rjenkins pg_num 250 pgp_num 250 last_change 39 flags hashpspool 
> crash_replay_interval 45 stripe_width 0
> pool 4 'downloads_metadata' replicated size 2 min_size 1 crush_ruleset 0 
> object_hash rjenkins pg_num 250 pgp_num 250 last_change 36 flags hashpspool 
> stripe_width 0
> 
> 
> ---- On Sun, 12 Nov 2017 15:04:02 +0530 Sébastien VIGNERON 
> <sebastien.vigne...@criann.fr <mailto:sebastien.vigne...@criann.fr>> wrote 
> ----
> 
> Hi,
> 
> Can you share:
>  - your placement rules: ceph osd crush rule dump
>  - your CEPH version: ceph versions
>  - your pools definitions: ceph osd pool ls detail
> 
> With these we can determine is your pgs are stuck because of a 
> misconfiguration or something else.
> 
> You seems to have some undersized pgs and a recovery in process. Does your 
> OSDs showed some rebalance of your datas? Does your OSDs use percentage 
> change over time? (changes in "ceph osd df")
> 
> Cordialement / Best regards,
> 
> Sébastien VIGNERON 
> CRIANN, 
> Ingénieur / Engineer
> Technopôle du Madrillet 
> 745, avenue de l'Université 
> 76800 Saint-Etienne du Rouvray - France 
> tél. +33 2 32 91 42 91 
> fax. +33 2 32 91 42 92 
> http://www.criann.fr <http://www.criann.fr/> 
> mailto:sebastien.vigne...@criann.fr <mailto:sebastien.vigne...@criann.fr>
> support: supp...@criann.fr <mailto:supp...@criann.fr>
> 
> Le 12 nov. 2017 à 10:04, gjprabu <gjpr...@zohocorp.com 
> <mailto:gjpr...@zohocorp.com>> a écrit :
> 
> Hi Team,
> 
>          We have ceph setup with 6 OSD and we got alert with 2 OSD is near 
> full . We faced issue like slow in accessing ceph from client. So i have 
> added 7th OSD and still 2 OSD is showing near full ( OSD.0 and OSD.4) , I 
> have restarted ceph service in osd.0 and osd.4 .  Kindly check the below ceph 
> osd status and please provide us the solutions. 
> 
> 
> # ceph health detail
> HEALTH_WARN 46 pgs backfill_wait; 1 pgs backfilling; 32 pgs degraded; 50 pgs 
> stuck unclean; 32 pgs undersized; recovery 1098780/40253637 objects degraded 
> (2.730%); recovery 3401433/40253637 objects misplaced (8.450%); 2 near full 
> osd(s); mds0: Client integ-hm3 failing to respond to cache pressure; mds0: 
> Client integ-hm8 failing to respond to cache pressure; mds0: Client integ-hm2 
> failing to respond to cache pressure; mds0: Client integ-hm9 failing to 
> respond to cache pressure; mds0: Client integ-hm5 failing to respond to cache 
> pressure; mds0: Client integ-hm9-bkp failing to respond to cache pressure; 
> mds0: Client me-build1-bkp failing to respond to cache pressure
> 
> pg 3.f6 is stuck unclean for 511223.069161, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.f6 is stuck unclean for 511232.770419, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.ec is stuck unclean for 510902.815668, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.eb is stuck unclean for 511285.576487, current state 
> active+remapped+wait_backfill, last acting [3,0]
> pg 4.17 is stuck unclean for 511235.326709, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 4.2f is stuck unclean for 511232.356371, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.3d is stuck unclean for 511300.446982, current state active+remapped, 
> last acting [3,0]
> pg 4.93 is stuck unclean for 511295.539229, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [3]
> pg 3.47 is stuck unclean for 511288.104965, current state 
> active+remapped+wait_backfill, last acting [3,0]
> pg 4.d5 is stuck unclean for 510916.509825, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.31 is stuck unclean for 511221.542878, current state 
> active+remapped+wait_backfill, last acting [0,3]
> pg 3.62 is stuck unclean for 511221.551662, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [4]
> pg 4.4d is stuck unclean for 511232.279602, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.48 is stuck unclean for 510911.095367, current state 
> active+remapped+wait_backfill, last acting [5,4]
> pg 3.4f is stuck unclean for 511226.712285, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 3.78 is stuck unclean for 511221.531199, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.24 is stuck unclean for 510903.483324, current state 
> active+remapped+backfilling, last acting [1,2]
> pg 4.8c is stuck unclean for 511231.668693, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 3.b4 is stuck unclean for 511222.612012, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [0]
> pg 4.41 is stuck unclean for 511287.031264, current state 
> active+remapped+wait_backfill, last acting [3,2]
> pg 3.d1 is stuck unclean for 510903.797329, current state 
> active+remapped+wait_backfill, last acting [0,3]
> pg 3.7f is stuck unclean for 511222.929722, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 4.af is stuck unclean for 511262.494659, current state 
> active+undersized+degraded+remapped, last acting [0]
> pg 3.66 is stuck unclean for 510903.296711, current state 
> active+remapped+wait_backfill, last acting [3,0]
> pg 3.76 is stuck unclean for 511224.615144, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [3]
> pg 4.57 is stuck unclean for 511234.514343, current state active+remapped, 
> last acting [0,4]
> pg 3.69 is stuck unclean for 511224.672085, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [4]
> pg 3.9a is stuck unclean for 510967.300000, current state 
> active+remapped+wait_backfill, last acting [3,2]
> pg 4.50 is stuck unclean for 510903.825565, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 4.53 is stuck unclean for 510921.975268, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.e7 is stuck unclean for 511221.530592, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.6a is stuck unclean for 510911.284877, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [0]
> pg 4.16 is stuck unclean for 511232.702762, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 3.2c is stuck unclean for 511222.443893, current state 
> active+remapped+wait_backfill, last acting [2,3]
> pg 4.89 is stuck unclean for 511228.846614, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [4]
> pg 4.39 is stuck unclean for 511239.544231, current state 
> active+remapped+wait_backfill, last acting [3,2]
> pg 4.ce is stuck unclean for 511232.294586, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 3.91 is stuck unclean for 511232.341380, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.96 is stuck unclean for 510904.043900, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.c0 is stuck unclean for 510904.253281, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.9c is stuck unclean for 511237.612850, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 3.ab is stuck unclean for 510960.756324, current state 
> active+remapped+wait_backfill, last acting [3,2]
> pg 4.aa is stuck unclean for 511229.307559, current state 
> active+remapped+wait_backfill, last acting [0,3]
> pg 3.ad is stuck unclean for 510903.764157, current state 
> active+remapped+wait_backfill, last acting [0,3]
> pg 3.b5 is stuck unclean for 511226.560774, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [3]
> pg 4.58 is stuck unclean for 510919.273667, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 4.b9 is stuck unclean for 511232.760066, current state 
> active+remapped+wait_backfill, last acting [5,4]
> pg 3.be <http://3.be/> is stuck unclean for 511224.422931, current state 
> active+remapped+wait_backfill, last acting [0,4]
> pg 4.d4 is stuck unclean for 510962.810416, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [3]
> pg 4.da is stuck unclean for 511259.506962, current state 
> active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.8c is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 3.7f is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 3.78 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.76 is active+undersized+degraded+remapped+wait_backfill, acting [3]
> pg 4.6a is active+undersized+degraded+remapped+wait_backfill, acting [0]
> pg 3.69 is active+undersized+degraded+remapped+wait_backfill, acting [4]
> pg 3.66 is active+remapped+wait_backfill, acting [3,0]
> pg 3.62 is active+undersized+degraded+remapped+wait_backfill, acting [4]
> pg 4.58 is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 4.50 is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 4.53 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.4f is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 4.48 is active+remapped+wait_backfill, acting [5,4]
> pg 4.4d is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.47 is active+remapped+wait_backfill, acting [3,0]
> pg 4.41 is active+remapped+wait_backfill, acting [3,2]
> pg 3.31 is active+remapped+wait_backfill, acting [0,3]
> pg 4.2f is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.24 is active+remapped+backfilling, acting [1,2]
> pg 4.17 is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 4.16 is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 3.2c is active+remapped+wait_backfill, acting [2,3]
> pg 4.39 is active+remapped+wait_backfill, acting [3,2]
> pg 4.89 is active+undersized+degraded+remapped+wait_backfill, acting [4]
> pg 3.91 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 4.93 is active+undersized+degraded+remapped+wait_backfill, acting [3]
> pg 3.96 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.9a is active+remapped+wait_backfill, acting [3,2]
> pg 4.9c is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 4.af is active+undersized+degraded+remapped, acting [0]
> pg 3.ab is active+remapped+wait_backfill, acting [3,2]
> pg 4.aa is active+remapped+wait_backfill, acting [0,3]
> pg 3.ad is active+remapped+wait_backfill, acting [0,3]
> pg 3.b4 is active+undersized+degraded+remapped+wait_backfill, acting [0]
> pg 3.b5 is active+undersized+degraded+remapped+wait_backfill, acting [3]
> pg 4.b9 is active+remapped+wait_backfill, acting [5,4]
> pg 3.be <http://3.be/> is active+remapped+wait_backfill, acting [0,4]
> pg 4.c0 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 4.ce is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 3.d1 is active+remapped+wait_backfill, acting [0,3]
> pg 4.d5 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 4.d4 is active+undersized+degraded+remapped+wait_backfill, acting [3]
> pg 4.da is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.e7 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.eb is active+remapped+wait_backfill, acting [3,0]
> pg 3.ec is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 4.f6 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.f6 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> recovery 1098780/40253637 objects degraded (2.730%)
> recovery 3401433/40253637 objects misplaced (8.450%)
> osd.0 is near full at 85%
> osd.4 is near full at 90%
> mds0: Client integ-hm3 failing to respond to cache pressure(client_id: 733998)
> mds0: Client integ-hm8 failing to respond to cache pressure(client_id: 843866)
> mds0: Client integ-hm2 failing to respond to cache pressure(client_id: 844939)
> mds0: Client integ-hm9 failing to respond to cache pressure(client_id: 845065)
> mds0: Client integ-hm5 failing to respond to cache pressure(client_id: 845068)
> mds0: Client integ-hm9-bkp failing to respond to cache pressure(client_id: 
> 895898)
> mds0: Client me-build1-bkp failing to respond to cache pressure(client_id: 
> 888666)
> 
> 
> hm ~]# ceph osd tree
> ID WEIGHT   TYPE NAME            UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 22.92604 root default                                          
> -2  3.29749     host intcfs-osd1                                  
> 0  3.29749         osd.0             up  1.00000          1.00000
> -3  3.26869     host intcfs-osd2                                  
> 1  3.26869         osd.1             up  1.00000          1.00000
> -4  3.27339     host intcfs-osd3                                  
> 2  3.27339         osd.2             up  1.00000          1.00000
> -5  3.24089     host intcfs-osd4                                  
> 3  3.24089         osd.3             up  1.00000          1.00000
> -6  3.24089     host intcfs-osd5                                  
> 4  3.24089         osd.4             up  1.00000          1.00000
> -7  3.32669     host intcfs-osd6                                  
> 5  3.32669         osd.5             up  1.00000          1.00000
> -8  3.27800     host intcfs-osd7                                  
> 6  3.27800         osd.6             up  1.00000          1.00000
> 
> 
> hm5 ~]# ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL %USE  VAR  PGS
> 0 3.29749  1.00000  3376G  2874G  502G 85.13 1.26 165
> 1 3.26869  1.00000  3347G  1922G 1424G 57.44 0.85 152
> 2 3.27339  1.00000  3351G  2009G 1342G 59.95 0.89 162
> 3 3.24089  1.00000  3318G  2130G 1188G 64.19 0.95 168
> 4 3.24089  1.00000  3318G  2996G  321G 90.30 1.34 176
> 5 3.32669  1.00000  3406G  2465G  940G 72.39 1.07 165
> 6 3.27800  1.00000  3356G  1435G 1921G 42.76 0.63 166
>               TOTAL 23476G 15834G 7641G 67.45         
> MIN/MAX VAR: 0.63/1.34  STDDEV: 15.29
> 
> 
> Regards
> Prabu GJ
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD is near full and slow in accessing storage from client

Reply via email to