Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

Blair Bethwaite Wed, 16 Apr 2014 19:59:32 -0700

Hi Kyle,

Thanks for the response. Further comments/queries...


> Message: 42
> Date: Wed, 16 Apr 2014 06:53:41 -0700
> From: Kyle Bader <kyle.ba...@gmail.com>
> Cc: ceph-users <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] SSDs: cache pool/tier versus node-local
>         block cache
> Message-ID:
>         <CAFMfnwpr73UFYzGWxJ7AScnhq4BCa5gZRYgRx-DLar4uS=
i...@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> >> Obviously the ssds could be used as journal devices, but I'm not really
> >> convinced whether this is worthwhile when all nodes have 1GB of
hardware
> >> writeback cache (writes to journal and data areas on the same spindle
have
> >> time to coalesce in the cache and minimise seek time hurt). Any advice
on
> >> this?
>
> All writes need to be written to the journal before being written to
> the data volume so it's going to impact your overall throughput and
> cause seeking, a hardware cache will only help with the later (unless
> you use btrfs).

Right, good point. So back of envelope calculations for throughput
scenarios based on our hardware, just saying 150MB/s r/w for the spindles
and 450/350MB/s r/w for the ssds, and pretending no controller bottlenecks
etc:

1 OSD node (without ssd journals, hence divide by 2):
9 * 150 / 2 = 675MB/s write throughput

1 OSD node (with ssd journals):
min(9 * 150, 3 * 350) = 1050MB/s write throughput

Aggregates for 12 OSDs: ~8GB/s versus 12.5GB/s

So the general naive case seems like a no-brainer, we should use SSD
journals. But then we don't require even 8GB/s most of the time...

> >> I think the timing should work that we'll be deploying with Firefly
and so
> >> have Ceph cache pool tiering as an option, but I'm also evaluating
Bcache
> >> versus Tier to act as node-local block cache device. Does anybody have
real
> >> or anecdotal evidence about which approach has better performance?
> > New idea that is dependent on failure behaviour of the cache tier...
>
> The problem with this type of configuration is it ties a VM to a
> specific hypervisor, in theory it should be faster because you don't
> have network latency from round trips to the cache tier, resulting in
> higher iops. Large sequential workloads may achieve higher throughput
> by parallelizing across many OSDs in a cache tier, whereas local flash
> would be limited to single device throughput.

Ah, I was ambiguous. When I said node-local I meant OSD-local. So I'm
really looking at:
2-copy write-back object ssd cache-pool
versus
OSD write-back ssd block-cache
versus
1-copy write-around object cache-pool & ssd journal

> > Carve the ssds 4-ways: each with 3 partitions for journals servicing the
> > backing data pool and a fourth larger partition serving a write-around
cache
> > tier with only 1 object copy. Thus both reads and writes hit ssd but
the ssd
> > capacity is not halved by replication for availability.
> >
> > ...The crux is how the current implementation behaves in the face of
cache
> > tier OSD failures?
>
> Cache tiers are durable by way of replication or erasure coding, OSDs
> will remap degraded placement groups and backfill as appropriate. With
> single replica cache pools loss of OSDs becomes a real concern, in the
> case of RBD this means losing arbitrary chunk(s) of your block devices
> - bad news. If you want host independence, durability and speed your
> best bet is a replicated cache pool (2-3x).

This is undoubtedly true for a write-back cache-tier. But in the scenario
I'm suggesting, a write-around cache, that needn't be bad news - if a
cache-tier OSD is lost the cache simply just got smaller and some cached
objects were unceremoniously flushed. The next read on those objects should
just miss and bring them into the now smaller cache.

The thing I'm trying to avoid with the above is double read-caching of
objects (so as to get more aggregate read cache). I assume the standard
wisdom with write-back cache-tiering is that the backing data pool
shouldn't bother with ssd journals?

--
Cheers,
~Blairo

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

Reply via email to