Re: [ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)

2017-07-28 Thread linghucongsong
1 You have 3 size pool I do not know why you set min_size 1. It is too dangous. 2 You had better use the same size and same num osds each host for crush. now you can try ceph osd reweight-by-utilization. command. When there is no user in you cluster. and I will go home. At 2017-07-28 17

Re: [ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)

2017-07-28 Thread Nikola Ciprich
On Fri, Jul 28, 2017 at 05:52:29PM +0800, linghucongsong wrote: > > > > You have two crush rule? One is ssd the other is hdd? yes, exactly.. > > Can you show ceph osd dump|grep pool > pool 3 'vm' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last

Re: [ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)

2017-07-28 Thread linghucongsong
You have two crush rule? One is ssd the other is hdd? Can you show ceph osd dump|grep pool ceph osd crush dump At 2017-07-28 17:47:48, "Nikola Ciprich" wrote: > >On Fri, Jul 28, 2017 at 05:43:14PM +0800, linghucongsong wrote: >> >> >> It look like the osd in your cluster is not all th

Re: [ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)

2017-07-28 Thread Nikola Ciprich
On Fri, Jul 28, 2017 at 05:43:14PM +0800, linghucongsong wrote: > > > It look like the osd in your cluster is not all the same size. > > can you show ceph osd df output? you're right, they're not.. here's the output: [root@v1b ~]# ceph osd df tree ID WEIGHT REWEIGHT SIZE USE AVAIL %U

Re: [ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)

2017-07-28 Thread linghucongsong
It look like the osd in your cluster is not all the same size. can you show ceph osd df output? At 2017-07-28 17:24:29, "Nikola Ciprich" wrote: >I forgot to add that OSD daemons really seem to be idle, no disk >activity, no CPU usage.. it just looks to me like some kind of >deadlock, as they

Re: [ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)

2017-07-28 Thread Nikola Ciprich
I forgot to add that OSD daemons really seem to be idle, no disk activity, no CPU usage.. it just looks to me like some kind of deadlock, as they were waiting for each other.. and so I'm trying to get last 1.5% of misplaced / degraded PGs for almost a week.. On Fri, Jul 28, 2017 at 10:56:02AM +

[ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)

2017-07-28 Thread Nikola Ciprich
Hi, I'm trying to find reason for strange recovery issues I'm seeing on our cluster.. it's mostly idle, 4 node cluster with 26 OSDs evenly distributed across nodes. jewel 10.2.9 the problem is that after some disk replaces and data moves, recovery is progressing extremely slowly.. pgs seem to be