Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

Christian Balzer Thu, 17 Apr 2014 01:46:16 -0700

On Thu, 17 Apr 2014 12:58:55 +1000 Blair Bethwaite wrote:

> Hi Kyle,
> 
> Thanks for the response. Further comments/queries...
> 
> > Message: 42
> > Date: Wed, 16 Apr 2014 06:53:41 -0700
> > From: Kyle Bader <kyle.ba...@gmail.com>
> > Cc: ceph-users <ceph-users@lists.ceph.com>
> > Subject: Re: [ceph-users] SSDs: cache pool/tier versus node-local
> >         block cache
> > Message-ID:
> >         <CAFMfnwpr73UFYzGWxJ7AScnhq4BCa5gZRYgRx-DLar4uS=
> i...@mail.gmail.com>
> > Content-Type: text/plain; charset=UTF-8
> >
> > >> Obviously the ssds could be used as journal devices, but I'm not
> > >> really convinced whether this is worthwhile when all nodes have 1GB
> > >> of
> hardware
> > >> writeback cache (writes to journal and data areas on the same
> > >> spindle
> have
> > >> time to coalesce in the cache and minimise seek time hurt). Any
> > >> advice
> on
> > >> this?
> >
> > All writes need to be written to the journal before being written to
> > the data volume so it's going to impact your overall throughput and
> > cause seeking, a hardware cache will only help with the later (unless
> > you use btrfs).
>


Indeed. Also a 1GB cache having to serve 12 spindles isn't as impressive
anymore when it comes down to per disk cache (assuming more or less
uniform activity). 
That hardware cache also will be used for reads (I've seen controllers
that allow you to influence the read/write cache usage ratio, but none
where you could disable caching reads right away).

Which leads me to another point, your journal SSDs will be hanging of that
same controller as the OSD HDDs.
Meaning that they will compete for hardware cache space that would be much
better used for the HDDs (again, I'm unaware of any controller that allows
to disable caching for individual disks).

That's why for my current first production cluster as well as any future
ones am planning to separate the SSDs from the OSDs whenever possible.

> Right, good point. So back of envelope calculations for throughput
> scenarios based on our hardware, just saying 150MB/s r/w for the spindles
> and 450/350MB/s r/w for the ssds, and pretending no controller
> bottlenecks etc:
> 
> 1 OSD node (without ssd journals, hence divide by 2):
> 9 * 150 / 2 = 675MB/s write throughput
> 
Which is, even though extremely optimistic, quite below your network
bandwidth.

> 1 OSD node (with ssd journals):
> min(9 * 150, 3 * 350) = 1050MB/s write throughput
> 
> Aggregates for 12 OSDs: ~8GB/s versus 12.5GB/s
> 
You get to divide those aggregate numbers by your replication factor and
if you value your data that is 3. 

That replication will also eat into your network bandwidth, making a
dedicated cluster network for replication potentially quite attractive.
But since in your case the disk bandwidth per node is pretty close to the
network bandwidth of 10GE, using the dual ports for a resilient public
network might be a better approach.

> So the general naive case seems like a no-brainer, we should use SSD
> journals. But then we don't require even 8GB/s most of the time...
> 
Well, first and foremost people here seem to obsessed with throughput,
everybody clamors about that and the rbd bench doesn't help either.

Unless you have a very special use case of basically writing or reading a
few large sequential files, you will run out of IOPS long before you run
out of raw bandwidth. 

And that's were caching/coalescing all along the way, from RBD cache for
the VMs, SSD journals to the hardware cache of your controller comes in.
These will all allow you to have peak performance far over the sustainable
IOPS of your backing HDDs, for some time at least.

In your case that sustained rate for the cluster you outlined would be
something (assuming 100 IOPS for those NL drives) like this:

 100(IOPS) x 9(disk) x 12(hosts) / 3(replication ratio) = 3600 IOPS

However that's ignoring all the other caches, in particular the controller
HW cache, which can raise the sustainable level quite a bit.

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

Reply via email to