Re: [ceph-users] Changing the failure domain

2017-09-01 Thread Laszlo Budai
Hello, We have checked all the drives, and there is no problem with them. If there would be a failing drive, then I think that the slow requests should appear also in the normal traffic as the ceph cluster is using all the OSDs as primaries for some PGs. But these slow requests are appearing

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread David Turner
Don't discount failing drives. You can have drives in a "ready-to-fail" state that doesn't show up in SMART or anywhere easy to track. When backfilling, the drive is using sectors it may not normally use. I managed a 1400 osd cluster that would lose 1-3 drives in random nodes when I added new stora

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread Laszlo Budai
Hi David, Well, most probably the larger part of our PGs will have to be reorganized, as we are moving from 9 hosts to 3 chassis. But I was hoping to be able to throttle the backfilling to an extent where it has minimal impact on our user traffic. Unfortunately I wasn't able to do it. I saw th

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread David Turner
That is normal to have backfilling because the crush map did change. The host and the chassis have crush numbers and their own weight which is the sum of the osds under them. By moving the host into the chassis you changed the weight of the chassis and that affects the PG placement even though you

Re: [ceph-users] Changing the failure domain

2017-08-31 Thread David Turner
How long are you seeing these blocked requests for? Initially or perpetually? Changing the failure domain causes all PGs to peer at the same time. This would be the cause if it happens really quickly. There is no way to avoid all of them peering while making a change like this. After that, It

[ceph-users] Changing the failure domain

2017-08-31 Thread Laszlo Budai
Dear all! In our Hammer cluster we are planning to switch our failure domain from host to chassis. We have performed some simulations, and regardless of the settings we have used some slow requests have appeared all the time. we had the the following settings: "osd_max_backfills": "1", "