> Message: 20 > Date: Thu, 17 Apr 2014 17:45:39 +0900 > From: Christian Balzer <ch...@gol.com> > To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> > Subject: Re: [ceph-users] SSDs: cache pool/tier versus node-local > block cache > Message-ID: <20140417174539.6c713...@batzmaru.gol.ad.jp> > Content-Type: text/plain; charset=US-ASCII > > On Thu, 17 Apr 2014 12:58:55 +1000 Blair Bethwaite wrote: > > > Hi Kyle, > > > > Thanks for the response. Further comments/queries... > > > > > Message: 42 > > > Date: Wed, 16 Apr 2014 06:53:41 -0700 > > > From: Kyle Bader <kyle.ba...@gmail.com> > > > Cc: ceph-users <ceph-users@lists.ceph.com> > > > Subject: Re: [ceph-users] SSDs: cache pool/tier versus node-local > > > block cache > > > Message-ID: > > > <CAFMfnwpr73UFYzGWxJ7AScnhq4BCa5gZRYgRx-DLar4uS= > > i...@mail.gmail.com> > > > Content-Type: text/plain; charset=UTF-8 > > > > > > >> Obviously the ssds could be used as journal devices, but I'm not > > > >> really convinced whether this is worthwhile when all nodes have 1GB > > > >> of > > hardware > > > >> writeback cache (writes to journal and data areas on the same > > > >> spindle > > have > > > >> time to coalesce in the cache and minimise seek time hurt). Any > > > >> advice > > on > > > >> this? > > > > > > All writes need to be written to the journal before being written to > > > the data volume so it's going to impact your overall throughput and > > > cause seeking, a hardware cache will only help with the later (unless > > > you use btrfs). > > > > Indeed. Also a 1GB cache having to serve 12 spindles isn't as impressive > anymore when it comes down to per disk cache (assuming more or less > uniform activity). > That hardware cache also will be used for reads (I've seen controllers > that allow you to influence the read/write cache usage ratio, but none > where you could disable caching reads right away). > > Which leads me to another point, your journal SSDs will be hanging of that > same controller as the OSD HDDs. > Meaning that they will compete for hardware cache space that would be much > better used for the HDDs (again, I'm unaware of any controller that allows > to disable caching for individual disks). > > That's why for my current first production cluster as well as any future > ones am planning to separate the SSDs from the OSDs whenever possible.
So the PERC 710p, whilst not having the native JBOD mode of the underlying LSI 2208 chipset, does allow per- virtual-disk cache and read-ahead mode settings. It also does support "Cut-Through IO" (CTIO), apparently enabled when the virtual-disk is set to no read-ahead and write-through caching. So my draft plan is that for our hardware we'll have 12x single-RAID0 virtual-disks, the 3 ssds will be set for CTIO. > > Right, good point. So back of envelope calculations for throughput > > scenarios based on our hardware, just saying 150MB/s r/w for the spindles > > and 450/350MB/s r/w for the ssds, and pretending no controller > > bottlenecks etc: > > > > 1 OSD node (without ssd journals, hence divide by 2): > > 9 * 150 / 2 = 675MB/s write throughput > > > Which is, even though extremely optimistic, quite below your network > bandwidth. Indeed (I'd say wildly optimistic, but for the sake of argument one has to have some sort of number/s). > > 1 OSD node (with ssd journals): > > min(9 * 150, 3 * 350) = 1050MB/s write throughput > > > > Aggregates for 12 OSDs: ~8GB/s versus 12.5GB/s > > > You get to divide those aggregate numbers by your replication factor and > if you value your data that is 3. Can anyone point to the reasoning/background behind the shift to favouring a 3x replication factor? When we started out it seemed that 2x was the recommendation, and that's what we're running with at present. Current use case is RBD volumes for working data and we're looking at integrating a cold-storage option for long-term durability of those, so our replication is mainly about availability. I assume 3x replication is more relevant for radosgw? There was an interesting discussion a while back about calculating data-loss probabilities under certain conditions but it didn't seem to have a definitive end... > That replication will also eat into your network bandwidth, making a > dedicated cluster network for replication potentially quite attractive. > But since in your case the disk bandwidth per node is pretty close to the > network bandwidth of 10GE, using the dual ports for a resilient public > network might be a better approach. Plan is to use L2 MSTP. So we have multiple VLANs, e.g., client-access and storage-private. They're bonded in active/passive configuration with each active on a different port and the VLANs having independent root bridges. In port/cable/switch failure-mode all VLANs get squished over the same port. > > So the general naive case seems like a no-brainer, we should use SSD > > journals. But then we don't require even 8GB/s most of the time... > > > Well, first and foremost people here seem to obsessed with throughput, > everybody clamors about that and the rbd bench doesn't help either. > > Unless you have a very special use case of basically writing or reading a > few large sequential files, you will run out of IOPS long before you run > out of raw bandwidth. Absolutely agree, and hence why I'm fishing for the best configuration and use of the ssds we have. I'm absolutely not married to the max write-throughput configuration if there are better options for sustained IOPS. > And that's were caching/coalescing all along the way, from RBD cache for > the VMs, SSD journals to the hardware cache of your controller comes in. > These will all allow you to have peak performance far over the sustainable > IOPS of your backing HDDs, for some time at least. > > In your case that sustained rate for the cluster you outlined would be > something (assuming 100 IOPS for those NL drives) like this: > > 100(IOPS) x 9(disk) x 12(hosts) / 3(replication ratio) = 3600 IOPS > > However that's ignoring all the other caches, in particular the controller > HW cache, which can raise the sustainable level quite a bit. Thanks for your useful comments! So I think we'll probably end up trying it both ways - 2x replica writeback cache-tier and Bcache (I guess there are at least two Bcache configs to test as well: 1) Bcache in writeback mode fronting journal and data on spindles, 2) Bcache in writearound mode with separate ssd journal partitions). Will share benchmarking results (real RBD clients, not RADOSBench) when we have them in a couple of months. -- Cheers, ~Blairo
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com