Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

Christian Balzer Thu, 17 Apr 2014 19:29:25 -0700

On Fri, 18 Apr 2014 11:34:15 +1000 Blair Bethwaite wrote:

> > Message: 20
> > Date: Thu, 17 Apr 2014 17:45:39 +0900
> > From: Christian Balzer <ch...@gol.com>
> > To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> > Subject: Re: [ceph-users] SSDs: cache pool/tier versus node-local
> >         block cache
> > Message-ID: <20140417174539.6c713...@batzmaru.gol.ad.jp>
> > Content-Type: text/plain; charset=US-ASCII
> >
> > On Thu, 17 Apr 2014 12:58:55 +1000 Blair Bethwaite wrote:
> >
> > > Hi Kyle,
> > >
> > > Thanks for the response. Further comments/queries...
> > >
> > > > Message: 42
> > > > Date: Wed, 16 Apr 2014 06:53:41 -0700
> > > > From: Kyle Bader <kyle.ba...@gmail.com>
> > > > Cc: ceph-users <ceph-users@lists.ceph.com>
> > > > Subject: Re: [ceph-users] SSDs: cache pool/tier versus node-local
> > > >         block cache
> > > > Message-ID:
> > > >         <CAFMfnwpr73UFYzGWxJ7AScnhq4BCa5gZRYgRx-DLar4uS=
> > > i...@mail.gmail.com>
> > > > Content-Type: text/plain; charset=UTF-8
> > > >
> > > > >> Obviously the ssds could be used as journal devices, but I'm not
> > > > >> really convinced whether this is worthwhile when all nodes have
> > > > >> 1GB of
> > > hardware
> > > > >> writeback cache (writes to journal and data areas on the same
> > > > >> spindle
> > > have
> > > > >> time to coalesce in the cache and minimise seek time hurt). Any
> > > > >> advice
> > > on
> > > > >> this?
> > > >
> > > > All writes need to be written to the journal before being written
> > > > to the data volume so it's going to impact your overall throughput
> > > > and cause seeking, a hardware cache will only help with the later
> > > > (unless you use btrfs).
> > >
> >
> > Indeed. Also a 1GB cache having to serve 12 spindles isn't as
> > impressive anymore when it comes down to per disk cache (assuming more
> > or less uniform activity).
> > That hardware cache also will be used for reads (I've seen controllers
> > that allow you to influence the read/write cache usage ratio, but none
> > where you could disable caching reads right away).
> >
> > Which leads me to another point, your journal SSDs will be hanging of
> > that same controller as the OSD HDDs.
> > Meaning that they will compete for hardware cache space that would be
> > much better used for the HDDs (again, I'm unaware of any controller
> > that allows to disable caching for individual disks).
> >
> > That's why for my current first production cluster as well as any
> > future ones am planning to separate the SSDs from the OSDs whenever
> > possible.
> 
> So the PERC 710p, whilst not having the native JBOD mode of the
> underlying LSI 2208 chipset, does allow per- virtual-disk cache and
> read-ahead mode settings. It also does support "Cut-Through IO" (CTIO),
> apparently enabled when the virtual-disk is set to no read-ahead and
> write-through caching. So my draft plan is that for our hardware we'll
> have 12x single-RAID0 virtual-disks, the 3 ssds will be set for CTIO.
> 
Ah, I've seen similar stuff with LSI 2108s, but not the CTIO bit.
What tends to be annoying about these single drive RAID0 virtual disks is
that the real drive is shielded from the OS. And with a cluster of your
size SMART data can and will be immensely helpful.


> > > Right, good point. So back of envelope calculations for throughput
> > > scenarios based on our hardware, just saying 150MB/s r/w for the
> spindles
> > > and 450/350MB/s r/w for the ssds, and pretending no controller
> > > bottlenecks etc:
> > >
> > > 1 OSD node (without ssd journals, hence divide by 2):
> > > 9 * 150 / 2 = 675MB/s write throughput
> > >
> > Which is, even though extremely optimistic, quite below your network
> > bandwidth.
> 
> Indeed (I'd say wildly optimistic, but for the sake of argument one has
> to have some sort of number/s).
> 
> > > 1 OSD node (with ssd journals):
> > > min(9 * 150, 3 * 350) = 1050MB/s write throughput
> > >
> > > Aggregates for 12 OSDs: ~8GB/s versus 12.5GB/s
> > >
> > You get to divide those aggregate numbers by your replication factor
> > and if you value your data that is 3.
> 
> Can anyone point to the reasoning/background behind the shift to
> favouring a 3x replication factor? When we started out it seemed that 2x
> was the recommendation, and that's what we're running with at present.
>
When I first looked at Ceph the default was 2 and everybody here and at
Inktank recommended 3. I think the default was/is going to be changed to 3
as well.

> Current use case is RBD volumes for working data and we're looking at
> integrating a cold-storage option for long-term durability of those, so
> our replication is mainly about availability. I assume 3x replication is
> more relevant for radosgw? There was an interesting discussion a while
> back about calculating data-loss probabilities under certain conditions
> but it didn't seem to have a definitive end...
> 
You're probably thinking about the thread called 
"Failure probability with largish deployments" that I started last year.

You might want to revisit that thread, the reliability modeling software
by Inktank was coming up with decent enough numbers for both RAID6 and a
replication factor of 3. 
And as Kyle in the last post to the thread said, it could do with some
improvements in that modeling, as it doesn't consider the number of disks
and assumes full speed recovery with Ceph.

Either way, a replication of 2 is more akin to RAID5 and once your cluster
becomes half full 2TB would have to be replicated in case of disk
failure before it is safe again. And my experience tells me that another
disk failure in that recovery window is just a question of time.

Heck, the CERN folks went for 4x replication for really valuable data.

For cold or lukewarm storage consider consider RAID6 backed OSDs, no SSD
journals, 2x replication. 
Slow to write to (IOPS wise), but much denser, cheaper than a 3x replicated
OSD. And if you have a few of those, still impressive reads. ^o^

> > That replication will also eat into your network bandwidth, making a
> > dedicated cluster network for replication potentially quite attractive.
> > But since in your case the disk bandwidth per node is pretty close to
> > the network bandwidth of 10GE, using the dual ports for a resilient
> > public network might be a better approach.
> 
> Plan is to use L2 MSTP. So we have multiple VLANs, e.g., client-access
> and storage-private. They're bonded in active/passive configuration with
> each active on a different port and the VLANs having independent root
> bridges. In port/cable/switch failure-mode all VLANs get squished over
> the same port.
> 
That sounds like a good plan indeed.

> > > So the general naive case seems like a no-brainer, we should use SSD
> > > journals. But then we don't require even 8GB/s most of the time...
> > >
> > Well, first and foremost people here seem to obsessed with throughput,
> > everybody clamors about that and the rbd bench doesn't help either.
> >
> > Unless you have a very special use case of basically writing or
> > reading a few large sequential files, you will run out of IOPS long
> > before you run out of raw bandwidth.
> 
> Absolutely agree, and hence why I'm fishing for the best configuration
> and use of the ssds we have. I'm absolutely not married to the max
> write-throughput configuration if there are better options for sustained
> IOPS.
> 
> > And that's were caching/coalescing all along the way, from RBD cache
> > for the VMs, SSD journals to the hardware cache of your controller
> > comes in. These will all allow you to have peak performance far over
> > the sustainable IOPS of your backing HDDs, for some time at least.
> >
> > In your case that sustained rate for the cluster you outlined would be
> > something (assuming 100 IOPS for those NL drives) like this:
> >
> >  100(IOPS) x 9(disk) x 12(hosts) / 3(replication ratio) = 3600 IOPS
> >
> > However that's ignoring all the other caches, in particular the
> > controller HW cache, which can raise the sustainable level quite a bit.
> 
> Thanks for your useful comments!
> 
> So I think we'll probably end up trying it both ways - 2x replica
> writeback cache-tier and Bcache (I guess there are at least two Bcache
> configs to test as well: 1) Bcache in writeback mode fronting journal
> and data on spindles, 2) Bcache in writearound mode with separate ssd
> journal partitions). Will share benchmarking results (real RBD clients,
> not RADOSBench) when we have them in a couple of months.
> 
I didn't comment on that part of your mail because I have zero experience
yet with these caches, but I have been thinking about using them in
similar ways that you intent to try. And am intending to try things out
with the next cluster as well. 

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

Reply via email to