Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

Blair Bethwaite Sun, 20 Apr 2014 07:47:13 -0700

Hi Christian,

On 18 April 2014 12:28, Christian Balzer <ch...@gol.com> wrote:
>
> On Fri, 18 Apr 2014 11:34:15 +1000 Blair Bethwaite wrote:
> > So the PERC 710p, whilst not having the native JBOD mode of the
> > underlying LSI 2208 chipset, does allow per- virtual-disk cache and
> > read-ahead mode settings. It also does support "Cut-Through IO" (CTIO),
> > apparently enabled when the virtual-disk is set to no read-ahead and
> > write-through caching. So my draft plan is that for our hardware we'll
> > have 12x single-RAID0 virtual-disks, the 3 ssds will be set for CTIO.
> >
> Ah, I've seen similar stuff with LSI 2108s, but not the CTIO bit.
> What tends to be annoying about these single drive RAID0 virtual disks is
> that the real drive is shielded from the OS. And with a cluster of your
> size SMART data can and will be immensely helpful.


Yes, I think earlier PERCs were bad for that, but at least the 700 and
800 lines advertise SMART support and it works. The smartctl man page
says:
=====
              megaraid,N  -  [Linux  only]  the device consists of one
or more SCSI/SAS disks con‐
              nected to a MegaRAID controller.  The non-negative
integer N (in the range of  0  to
              127  inclusive)  denotes which disk on the controller is
monitored.  Use syntax such
              as:
              smartctl -a -d megaraid,2 /dev/sda
              smartctl -a -d megaraid,0 /dev/sdb
              This interface will also work for Dell PERC
controllers.   The  following  /dev/XXX
              entry must exist:
              For PERC2/3/4 controllers: /dev/megadev0
              For PERC5/6 controllers: /dev/megaraid_sas_ioctl_node
=====

On our current set of R720XDs (no SSD in these) I find "smartctl -d
megaraid,0 -a /dev/sda" through "smartctl -d megaraid,11 -a /dev/sda"
give me details of the 12x front-bay data drives (NB: the actual
device doesn't seem to matter so long as it exists), whilst "smartctl
-d megaraid,12 -a /dev/sda" and "smartctl -d megaraid,13 -a /dev/sda"
are the internal drives (which we have in a RAID1 for OS etc).

> > Current use case is RBD volumes for working data and we're looking at
> > integrating a cold-storage option for long-term durability of those, so
> > our replication is mainly about availability. I assume 3x replication is
> > more relevant for radosgw? There was an interesting discussion a while
> > back about calculating data-loss probabilities under certain conditions
> > but it didn't seem to have a definitive end...
> >
> You're probably thinking about the thread called
> "Failure probability with largish deployments" that I started last year.
>
> You might want to revisit that thread, the reliability modeling software
> by Inktank was coming up with decent enough numbers for both RAID6 and a
> replication factor of 3.
> And as Kyle in the last post to the thread said, it could do with some
> improvements in that modeling, as it doesn't consider the number of disks
> and assumes full speed recovery with Ceph.
>
> Either way, a replication of 2 is more akin to RAID5 and once your cluster
> becomes half full 2TB would have to be replicated in case of disk
> failure before it is safe again. And my experience tells me that another
> disk failure in that recovery window is just a question of time.

Murphy's Law.

I guess this is why the Gluster/RHS folks suggest host-RAID as well.
EC really seems to be the way to go to avoid all this, but then the
question becomes "how big does my cache-tier need to be?".

> Heck, the CERN folks went for 4x replication for really valuable data.
>
> For cold or lukewarm storage consider consider RAID6 backed OSDs, no SSD
> journals, 2x replication.
> Slow to write to (IOPS wise), but much denser, cheaper than a 3x replicated
> OSD. And if you have a few of those, still impressive reads. ^o^

We are considering something that dumps RBDs (tagged for durability)
out to tape. Would likely do this asynchronously to start with,
perhaps triggered on "detach". The icing would be a way to then
transparently zero the RBD but recall it if needed.

--
Cheers,
~Blairo
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

Reply via email to