On 11/05/2015 11:03 PM, Michal Kozanecki wrote: > Why did you guys go with partitioning the SSD for ceph journals, instead of > just using the whole SSD for bcache and leaving the journal on the filesystem > (which itself is ontop bcache)? Was there really a benefit to separating the > journals from the bcache fronted HDDs? > > I ask because it has been shown in the past that separating the journal on > SSD based pools doesn't really do much. >
Well, the I/O for the journal by-passes bcache completely in this case. The less code the I/O travels through the better we figured. We didn't try with the Journal on bcache. This works for us so we didn't mind testing anything different. Wido > Michal Kozanecki | Linux Administrator | mkozane...@evertz.com > > > -----Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido > den Hollander > Sent: October-28-15 5:49 AM > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Ceph OSDs with bcache experience > > > > On 21-10-15 15:30, Mark Nelson wrote: >> >> >> On 10/21/2015 01:59 AM, Wido den Hollander wrote: >>> On 10/20/2015 07:44 PM, Mark Nelson wrote: >>>> On 10/20/2015 09:00 AM, Wido den Hollander wrote: >>>>> Hi, >>>>> >>>>> In the "newstore direction" thread on ceph-devel I wrote that I'm >>>>> using bcache in production and Mark Nelson asked me to share some details. >>>>> >>>>> Bcache is running in two clusters now that I manage, but I'll keep >>>>> this information to one of them (the one at PCextreme behind CloudStack). >>>>> >>>>> In this cluster has been running for over 2 years now: >>>>> >>>>> epoch 284353 >>>>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307 >>>>> created 2013-09-23 11:06:11.819520 >>>>> modified 2015-10-20 15:27:48.734213 >>>>> >>>>> The system consists out of 39 hosts: >>>>> >>>>> 2U SuperMicro chassis: >>>>> * 80GB Intel SSD for OS >>>>> * 240GB Intel S3700 SSD for Journaling + Bcache >>>>> * 6x 3TB disk >>>>> >>>>> This isn't the newest hardware. The next batch of hardware will be >>>>> more disks per chassis, but this is it for now. >>>>> >>>>> All systems were installed with Ubuntu 12.04, but they are all >>>>> running >>>>> 14.04 now with bcache. >>>>> >>>>> The Intel S3700 SSD is partitioned with a GPT label: >>>>> - 5GB Journal for each OSD >>>>> - 200GB Partition for bcache >>>>> >>>>> root@ceph11:~# df -h|grep osd >>>>> /dev/bcache0 2.8T 1.1T 1.8T 38% /var/lib/ceph/osd/ceph-60 >>>>> /dev/bcache1 2.8T 1.2T 1.7T 41% /var/lib/ceph/osd/ceph-61 >>>>> /dev/bcache2 2.8T 930G 1.9T 34% /var/lib/ceph/osd/ceph-62 >>>>> /dev/bcache3 2.8T 970G 1.8T 35% /var/lib/ceph/osd/ceph-63 >>>>> /dev/bcache4 2.8T 814G 2.0T 30% /var/lib/ceph/osd/ceph-64 >>>>> /dev/bcache5 2.8T 915G 1.9T 33% /var/lib/ceph/osd/ceph-65 >>>>> root@ceph11:~# >>>>> >>>>> root@ceph11:~# lsb_release -a >>>>> No LSB modules are available. >>>>> Distributor ID: Ubuntu >>>>> Description: Ubuntu 14.04.3 LTS >>>>> Release: 14.04 >>>>> Codename: trusty >>>>> root@ceph11:~# uname -r >>>>> 3.19.0-30-generic >>>>> root@ceph11:~# >>>>> >>>>> "apply_latency": { >>>>> "avgcount": 2985023, >>>>> "sum": 226219.891559000 >>>>> } >>>>> >>>>> What did we notice? >>>>> - Less spikes on the disk >>>>> - Lower commit latencies on the OSDs >>>>> - Almost no 'slow requests' during backfills >>>>> - Cache-hit ratio of about 60% >>>>> >>>>> Max backfills and recovery active are both set to 1 on all OSDs. >>>>> >>>>> For the next generation hardware we are looking into using 3U >>>>> chassis with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, >>>>> but we haven't tested those yet, so nothing to say about it. >>>>> >>>>> The current setup is 200GB of cache for 18TB of disks. The new >>>>> setup will be 1200GB for 64TB, curious to see what that does. >>>>> >>>>> Our main conclusion however is that it does smoothen the >>>>> I/O-pattern towards the disks and that gives a overall better >>>>> response of the disks. >>>> >>>> Hi Wido, thanks for the big writeup! Did you guys happen to do any >>>> benchmarking? I think Xiaoxi looked at flashcache a while back but >>>> had mixed results if I remember right. It would be interesting to >>>> know how bcache is affecting performance in different scenarios. >>>> >>> >>> No, we didn't do any benchmarking. Initially this cluster was build >>> for just the RADOS Gateway, so we went for 2Gbit (2x 1Gbit) per >>> machine. 90% is still Gbit networking and we are in the process of >>> upgrading it all to 10Gbit. >>> >>> Since the 1Gbit network latency is about 4 times higher then 10Gbit >>> we aren't really benchmarking the cluster. >>> >>> What counts for us most is that we can do recovery operations without >>> any slow requests. >>> >>> Before bcache we saw disks spike to 100% busy while a backfill was busy. >>> Now bcache smoothens this and we see peaks of maybe 70%, but that's it. >> >> In the testing I was doing to figure out our new lab hardware, I was >> seeing SSDs handle recovery dramatically better than spinning disks as >> well during cephtestrados runs. It might be worth digging in to see >> what the IO patterns look like. In the mean time though, it's very >> interesting that bcache helps in this case so much. Good to know! >> > > To add to this. We still had to enable hashpspools on a few pools, so we did. > The degradation went to 39% on the cluster and it has been recovering for > over 48 hours now. > > Not a single slow request while we had the OSD complaint time set to 5 > seconds. After setting this to 0.5 seconds we saw some slow requests, but > nothing dramatic. > > For us bcache works really great with spinning disks. > > Wido > >>> >>>> Thanks, >>>> Mark >>>> >>>>> >>>>> Wido >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com