Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

Blair Bethwaite Thu, 17 Apr 2014 18:35:23 -0700

> Message: 20
> Date: Thu, 17 Apr 2014 17:45:39 +0900
> From: Christian Balzer <ch...@gol.com>
> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] SSDs: cache pool/tier versus node-local
>         block cache
> Message-ID: <20140417174539.6c713...@batzmaru.gol.ad.jp>
> Content-Type: text/plain; charset=US-ASCII
>
> On Thu, 17 Apr 2014 12:58:55 +1000 Blair Bethwaite wrote:
>
> > Hi Kyle,
> >
> > Thanks for the response. Further comments/queries...
> >
> > > Message: 42
> > > Date: Wed, 16 Apr 2014 06:53:41 -0700
> > > From: Kyle Bader <kyle.ba...@gmail.com>
> > > Cc: ceph-users <ceph-users@lists.ceph.com>
> > > Subject: Re: [ceph-users] SSDs: cache pool/tier versus node-local
> > >         block cache
> > > Message-ID:
> > >         <CAFMfnwpr73UFYzGWxJ7AScnhq4BCa5gZRYgRx-DLar4uS=
> > i...@mail.gmail.com>
> > > Content-Type: text/plain; charset=UTF-8
> > >
> > > >> Obviously the ssds could be used as journal devices, but I'm not
> > > >> really convinced whether this is worthwhile when all nodes have 1GB
> > > >> of
> > hardware
> > > >> writeback cache (writes to journal and data areas on the same
> > > >> spindle
> > have
> > > >> time to coalesce in the cache and minimise seek time hurt). Any
> > > >> advice
> > on
> > > >> this?
> > >
> > > All writes need to be written to the journal before being written to
> > > the data volume so it's going to impact your overall throughput and
> > > cause seeking, a hardware cache will only help with the later (unless
> > > you use btrfs).
> >
>
> Indeed. Also a 1GB cache having to serve 12 spindles isn't as impressive
> anymore when it comes down to per disk cache (assuming more or less
> uniform activity).
> That hardware cache also will be used for reads (I've seen controllers
> that allow you to influence the read/write cache usage ratio, but none
> where you could disable caching reads right away).
>
> Which leads me to another point, your journal SSDs will be hanging of that
> same controller as the OSD HDDs.
> Meaning that they will compete for hardware cache space that would be much
> better used for the HDDs (again, I'm unaware of any controller that allows
> to disable caching for individual disks).
>
> That's why for my current first production cluster as well as any future
> ones am planning to separate the SSDs from the OSDs whenever possible.


So the PERC 710p, whilst not having the native JBOD mode of the underlying
LSI 2208 chipset, does allow per- virtual-disk cache and read-ahead mode
settings. It also does support "Cut-Through IO" (CTIO), apparently enabled
when the virtual-disk is set to no read-ahead and write-through caching. So
my draft plan is that for our hardware we'll have 12x single-RAID0
virtual-disks, the 3 ssds will be set for CTIO.

> > Right, good point. So back of envelope calculations for throughput
> > scenarios based on our hardware, just saying 150MB/s r/w for the
spindles
> > and 450/350MB/s r/w for the ssds, and pretending no controller
> > bottlenecks etc:
> >
> > 1 OSD node (without ssd journals, hence divide by 2):
> > 9 * 150 / 2 = 675MB/s write throughput
> >
> Which is, even though extremely optimistic, quite below your network
> bandwidth.

Indeed (I'd say wildly optimistic, but for the sake of argument one has to
have some sort of number/s).

> > 1 OSD node (with ssd journals):
> > min(9 * 150, 3 * 350) = 1050MB/s write throughput
> >
> > Aggregates for 12 OSDs: ~8GB/s versus 12.5GB/s
> >
> You get to divide those aggregate numbers by your replication factor and
> if you value your data that is 3.

Can anyone point to the reasoning/background behind the shift to favouring
a 3x replication factor? When we started out it seemed that 2x was the
recommendation, and that's what we're running with at present. Current use
case is RBD volumes for working data and we're looking at integrating a
cold-storage option for long-term durability of those, so our replication
is mainly about availability. I assume 3x replication is more relevant for
radosgw? There was an interesting discussion a while back about calculating
data-loss probabilities under certain conditions but it didn't seem to have
a definitive end...

> That replication will also eat into your network bandwidth, making a
> dedicated cluster network for replication potentially quite attractive.
> But since in your case the disk bandwidth per node is pretty close to the
> network bandwidth of 10GE, using the dual ports for a resilient public
> network might be a better approach.

Plan is to use L2 MSTP. So we have multiple VLANs, e.g., client-access and
storage-private. They're bonded in active/passive configuration with each
active on a different port and the VLANs having independent root bridges.
In port/cable/switch failure-mode all VLANs get squished over the same port.

> > So the general naive case seems like a no-brainer, we should use SSD
> > journals. But then we don't require even 8GB/s most of the time...
> >
> Well, first and foremost people here seem to obsessed with throughput,
> everybody clamors about that and the rbd bench doesn't help either.
>
> Unless you have a very special use case of basically writing or reading a
> few large sequential files, you will run out of IOPS long before you run
> out of raw bandwidth.

Absolutely agree, and hence why I'm fishing for the best configuration and
use of the ssds we have. I'm absolutely not married to the max
write-throughput configuration if there are better options for sustained
IOPS.

> And that's were caching/coalescing all along the way, from RBD cache for
> the VMs, SSD journals to the hardware cache of your controller comes in.
> These will all allow you to have peak performance far over the sustainable
> IOPS of your backing HDDs, for some time at least.
>
> In your case that sustained rate for the cluster you outlined would be
> something (assuming 100 IOPS for those NL drives) like this:
>
>  100(IOPS) x 9(disk) x 12(hosts) / 3(replication ratio) = 3600 IOPS
>
> However that's ignoring all the other caches, in particular the controller
> HW cache, which can raise the sustainable level quite a bit.

Thanks for your useful comments!

So I think we'll probably end up trying it both ways - 2x replica writeback
cache-tier and Bcache (I guess there are at least two Bcache configs to
test as well: 1) Bcache in writeback mode fronting journal and data on
spindles, 2) Bcache in writearound mode with separate ssd journal
partitions). Will share benchmarking results (real RBD clients, not
RADOSBench) when we have them in a couple of months.

--
Cheers,
~Blairo

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

Reply via email to