On 7/25/19 9:27 PM, Anthony D'Atri wrote:
We run few hundred HDD OSDs for our backup cluster, we set one RAID 0 per HDD 
in order to be able
to use -battery protected- write cache from the RAID controller. It really 
improves performance, for both
bluestore and filestore OSDs.
Having run something like 6000 HDD-based FileStore OSDs with colo journals on 
RAID HBAs I’d like to offer some contrasting thoughts.

TL;DR:  Never again!  False economy.  ymmv.

Details:

* The implementation predated me and was carved in dogfood^H^H^H^H^H^H^Hstone, 
try as I might I could not get it fixed.

* Single-drive RAID0 VDs were created to expose the underlying drives to the 
OS.  When the architecture was conceived, the HBAs in question didn’t have 
JBOD/passthrough, though a firmware update shortly thereafter did bring that 
ability.  That caching was a function of VDs wasn’t known at the time.

* My sense was that the FBWC did offer some throughput performance for at least 
some workloads, but at the cost of latency.

* Using a RAID-capable HBA in IR mode with FBWC meant having to monitor for the 
presence and status of the BBU/supercap

* The utility needed for that monitoring, when invoked with ostensibly 
innocuous parameters, would lock up the HBA for several seconds.

* Traditional BBUs are rated for lifespan of *only* one year.  FBWCs maybe for 
… three?  Significant cost to RMA or replace them:  time and karma wasted 
fighting with the system vendor CSO, engineer and remote hands time to take the 
system down and swap.  And then the connectors for the supercap were touchy; 
15% of the time the system would come up and not see it at all.

* The RAID-capable HBA itself + FBWC + supercap cost …. a couple three hundred 
more than an IT / JBOD equivalent

* There was a little-known flaw in secondary firmware that caused FBWC / 
supercap modules to be falsely reported bad.  The system vendor acted like I 
was making this up and washed their hands of it, even when I provided them the 
HBA vendors’ artifacts and documents.

* There were two design flaws that could and did result in cache data loss when 
a system rebooted or lost power.  There was a field notice for this, which 
required harvesting serial numbers and checking each.  The affected range of 
serials was quite a bit larger than what the validation tool admitted.  I had 
to manage the replacement of 302+ of these in production use, each needing 
engineer time time to manage Ceph, to do the hands work, and hassle with RMA 
paperwork.

* There was a firmware / utility design flaw that caused the HDD’s onboard 
volatile write cache to be silently turned on, despite an HBA config dump 
showing a setting that should have left it off.  Again data was lost when a 
node crashed hard or lost power.

* There was another firmware flaw that prevented booting if there was pinned / 
preserved cache data after a reboot / power loss if a drive failed or was 
yanked.  The HBA’s option ROM utility would block booting and wait for input on 
the console.  One could get in and tell it to discard that cache, but it would 
not actually do so, instead looping back to the same screen.  The only way to 
get the system to boot again was to replace and RMA the HBA.

* The VD layer lessened the usefulness of iostat data.  It also complicated OSD 
deployment / removal / replacement.  A smartctl hack to access SMART attributes 
below the VD layer would work on some systems but not others.

* The HBA model in question would work normally with a certain CPU generation, 
but not with slightly newer servers with the next CPU generation.  They would 
randomly, on roughly one boot out of five, negotiate PCIe gen3 which they 
weren’t capable of handling properly, and would silently run at about 20% of 
normal speed.  Granted this isn’t necessarily specific to an IR HBA.



Add it all up, and my assertion is that the money, time, karma, and user impact 
you save from NOT dealing with a RAID HBA *more than pays for* using SSDs for 
OSDs instead.


This is worse than I feared, but very much in the realm of concerns I had with using single-disk RAID0 setups.  Thank you very much for posting your experience!  My money would still be on using *high write endurance* NVMes for DB/WAL and whatever I could afford for block.  I still have vague hopes that in the long run we move away from the idea of of distinct block/db/wal devices and toward pools of resources that the OSD makes it's own decisions about.  I'd like to be able to hand the OSD a pile of hardware and say "go".  That might mean something like an internal caching scheme but with slow eviction and initial placement hints (IE L0 SST files should nearly always end up on fast storage).


If it were structured like the PriorityCacheManager, we'd have SSTs for different column family prefixes (OMAP, onodes, etc) competing for fast BlueFS device storage with bluestore at different priority levels (so for example onode L0 would be very high priority) and independent LRUs for each.  I'm hoping some of Igor's work on SST placement might help make this possible down the road.  On the other hand, maybe crimson, pmem, and cheap high-capacity flash are going to make all of that less necessary. I guess we'll find out. :)


Thanks,

Mark






_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to