Update on the subject, warning, lengthy post but reproducible results and 
workaround to get performance back to expected level.

One of the servers had a broken disk controller causing some performance issues 
on this one host, FIO showed about half performance on some disks compared to 
the other hosts, it’s been replaced but rados performance did not improve.

- write: 313.86 MB/s / 0.203907ms

I figured it would be wise to test all servers and disks to see if they deliver 
the expected performance.

Since I have data on the cluster that I wanted to keep online I did one server 
at a time, delete the 3 OSDs, FIO test them, recreate the OSDs and add them 
back to the cluster, wait till the cluster is healthy and on to the next 
server. All disks match the expected performance now so with all servers and 
OSDs up again I redid the rados benchmark.

Performance was almost twice as good as before recreating all OSDs and on par 
with what I had expected for bandwidth and latency.

- write: 586.679 MB/s / 0.109085ms
- read: 2092.27 MB/s / 0.0292913ms

As the controller was not to blame I wanted to test if having different size 
OSDs with ‘correct’ weights assigned was causing the issue, I removed one OSD 
on each storage node (one node at a time) and re-partitioned it and added it 
back to the cluster with the correct weight, performance was still ok though a 
little slower as before.

Figuring this wasn’t the cause either I took out the OSDs with partitions 
again, wiped the disks and recreated the OSDs. Performance now was even lower, 
almost as low as when I just swapped the controller.

Since I knew the performance could be better I decided to recreate all OSDs one 
server at a time and performance once again was good.

Since now I was able to reproduce the issue I started once more and document 
all the steps to see if there is any logic to the issue.

With the cluster performing well I started removing one OSD at a time, wait for 
the cluster to become healthy again, benchmark, add it back and on to the next 
server.

These are the results of each step.

One OSD out:

write: 528.439 / 0.121021
read: 2022.19 / 0.03031

OSD back in again:

write: 584.14 / 0.10956
read: 2108.06 / 0.0289867

Next server, one OSD out:

write: 482.923 / 0.132512
read: 2008.38 / 0.0305356

OSD back in again:

write: 578.034 / 0.110686
read: 2059.24 / 0.0297554

Next server, one OSD out:

write: 470.384 / 0.136055
read: 2043.68 / 0.0299759

OSD back in again:

write: 424.01 / 0.150886
read: 2086.94 / 0.0293182

Write performance now is significantly lower as when I started. When I first 
wrote on the mailing list performance seems to go up once CEPH enters 
'near-full' state so I decided to test that again.

I reached full by accident and the last two write tests showed somewhat better 
performance but not near the level I started with.

write: 468.632 / 0.136559
write: 488.523 / 0.130999

I removed the benchmark pool and recreated it, testing a few more times, 
performance now seems even lower again and again near the results I started off 
with.

write: 449.034 / 0.142524
write: 399.532 / 0.160168
write: 366.831 / 0.174451

I know how to get the performance back to the expected level by recreating all 
OSDs and shuffling data around the cluster but I don’t think this should happen 
in the first place.

Just to clarify when removing an OSD I reweigh it to 0, wait for it’s safe to 
delete the OSD, I assume this is the correct way of doing such things.

Am I doing something wrong? Did I run into some sort of bug?

I’m running Proxmox VE 5.2 which includes ceph version 12.2.7 
(94ce186ac93bb28c3c444bccfefb8a31eb0748e4) luminous (stable)

Thanks,
Menno


My script to safely remove an OSD:

ceph osd crush reweight osd.$1 0.0
while ! ceph osd safe-to-destroy $1; do echo "not safe to destroy, waiting.." ; 
sleep 10 ; done
sleep 5
ceph osd out $1
systemctl disable ceph-osd@$1
systemctl stop ceph-osd@$1
ceph osd crush remove osd.$1
ceph auth del osd.$1
ceph osd down $1
ceph osd rm $1

-----Original message-----
> From:Menno Zonneveld <me...@1afa.com>
> Sent: Monday 10th September 2018 11:45
> To: Alwin Antreich <a.antre...@proxmox.com>; ceph-users 
> <ceph-users@lists.ceph.com>
> Cc: Marc Roos <m.r...@f1-outsourcing.eu>
> Subject: RE: [ceph-users] Rados performance inconsistencies, lower than 
> expected performance
> 
> 
> -----Original message-----
> > From:Alwin Antreich <a.antre...@proxmox.com>
> > Sent: Thursday 6th September 2018 18:36
> > To: ceph-users <ceph-users@lists.ceph.com>
> > Cc: Menno Zonneveld <me...@1afa.com>; Marc Roos <m.r...@f1-outsourcing.eu>
> > Subject: Re: [ceph-users] Rados performance inconsistencies, lower than
> expected performance
> > 
> > On Thu, Sep 06, 2018 at 05:15:26PM +0200, Marc Roos wrote:
> > > 
> > > It is idle, testing still, running a backup's at night on it.
> > > How do you fill up the cluster so you can test between empty and full?
> 
> > > Do you have a "ceph df" from empty and full? 
> > > 
> > > I have done another test disabling new scrubs on the rbd.ssd pool (but
> 
> > > still 3 on hdd) with:
> > > ceph tell osd.* injectargs --osd_max_backfills=0
> > > Again getting slower towards the end.
> > > Bandwidth (MB/sec):     395.749
> > > Average Latency(s):     0.161713
> > In the results you both had, the latency is twice as high as in our
> > tests [1]. That can already make quiet some difference. Depending on the
> > actual hardware used, there may or may not be the possibility for good
> > optimisation.
> > 
> > As a start, you could test the disks with fio, as shown in our benchmark
> > paper, to get some results for comparison. The forum thread [1] has
> > some benchmarks from other users for comparison.
> > 
> > [1] 
> > https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
> 
> Thanks for the suggestion, I redid the fio test and one server seem to be
> causing trouble.
> 
> When I initially tested our SSD's according to the benchmark paper our Intel
> SSD's performed more or less equal to the Samsung SSD's used.
> 
> from fio.log
> 
> fio: (groupid=0, jobs=1): err= 0: pid=3606315: Mon Sep 10 11:12:36 2018
>   write: io=4005.9MB, bw=68366KB/s, iops=17091, runt= 60001msec
>     slat (usec): min=5, max=252, avg= 5.76, stdev= 0.66
>     clat (usec): min=6, max=949, avg=51.72, stdev= 9.54
>      lat (usec): min=54, max=955, avg=57.48, stdev= 9.56
> 
> However one of the other machines (with identical SSD's) now performs poorly
> compared to the others with these results
> 
> fio: (groupid=0, jobs=1): err= 0: pid=3893600: Mon Sep 10 11:15:17 2018
>   write: io=1258.8MB, bw=51801KB/s, iops=12950, runt= 24883msec
>     slat (usec): min=5, max=259, avg= 6.17, stdev= 0.78
>     clat (usec): min=53, max=857, avg=69.77, stdev=13.11
>      lat (usec): min=70, max=863, avg=75.93, stdev=13.17
> 
> I'll first resolve the slower machine before doing more testing as this surely
> won't help overall performance.
> 
> 
> > --
> > Cheers,
> > Alwin
> 
> Thanks!,
> Menno
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to