On 7/25/19 9:27 PM, Anthony D'Atri wrote:
We run few hundred HDD OSDs for our backup cluster, we set one RAID 0 per HDD
in order to be able
to use -battery protected- write cache from the RAID controller. It really
improves performance, for both
bluestore and filestore OSDs.
Having run something like 6000 HDD-based FileStore OSDs with colo journals on
RAID HBAs I’d like to offer some contrasting thoughts.
TL;DR: Never again! False economy. ymmv.
Details:
* The implementation predated me and was carved in dogfood^H^H^H^H^H^H^Hstone,
try as I might I could not get it fixed.
* Single-drive RAID0 VDs were created to expose the underlying drives to the
OS. When the architecture was conceived, the HBAs in question didn’t have
JBOD/passthrough, though a firmware update shortly thereafter did bring that
ability. That caching was a function of VDs wasn’t known at the time.
* My sense was that the FBWC did offer some throughput performance for at least
some workloads, but at the cost of latency.
* Using a RAID-capable HBA in IR mode with FBWC meant having to monitor for the
presence and status of the BBU/supercap
* The utility needed for that monitoring, when invoked with ostensibly
innocuous parameters, would lock up the HBA for several seconds.
* Traditional BBUs are rated for lifespan of *only* one year. FBWCs maybe for
… three? Significant cost to RMA or replace them: time and karma wasted
fighting with the system vendor CSO, engineer and remote hands time to take the
system down and swap. And then the connectors for the supercap were touchy;
15% of the time the system would come up and not see it at all.
* The RAID-capable HBA itself + FBWC + supercap cost …. a couple three hundred
more than an IT / JBOD equivalent
* There was a little-known flaw in secondary firmware that caused FBWC /
supercap modules to be falsely reported bad. The system vendor acted like I
was making this up and washed their hands of it, even when I provided them the
HBA vendors’ artifacts and documents.
* There were two design flaws that could and did result in cache data loss when
a system rebooted or lost power. There was a field notice for this, which
required harvesting serial numbers and checking each. The affected range of
serials was quite a bit larger than what the validation tool admitted. I had
to manage the replacement of 302+ of these in production use, each needing
engineer time time to manage Ceph, to do the hands work, and hassle with RMA
paperwork.
* There was a firmware / utility design flaw that caused the HDD’s onboard
volatile write cache to be silently turned on, despite an HBA config dump
showing a setting that should have left it off. Again data was lost when a
node crashed hard or lost power.
* There was another firmware flaw that prevented booting if there was pinned /
preserved cache data after a reboot / power loss if a drive failed or was
yanked. The HBA’s option ROM utility would block booting and wait for input on
the console. One could get in and tell it to discard that cache, but it would
not actually do so, instead looping back to the same screen. The only way to
get the system to boot again was to replace and RMA the HBA.
* The VD layer lessened the usefulness of iostat data. It also complicated OSD
deployment / removal / replacement. A smartctl hack to access SMART attributes
below the VD layer would work on some systems but not others.
* The HBA model in question would work normally with a certain CPU generation,
but not with slightly newer servers with the next CPU generation. They would
randomly, on roughly one boot out of five, negotiate PCIe gen3 which they
weren’t capable of handling properly, and would silently run at about 20% of
normal speed. Granted this isn’t necessarily specific to an IR HBA.
Add it all up, and my assertion is that the money, time, karma, and user impact
you save from NOT dealing with a RAID HBA *more than pays for* using SSDs for
OSDs instead.
This is worse than I feared, but very much in the realm of concerns I
had with using single-disk RAID0 setups. Thank you very much for
posting your experience! My money would still be on using *high write
endurance* NVMes for DB/WAL and whatever I could afford for block. I
still have vague hopes that in the long run we move away from the idea
of of distinct block/db/wal devices and toward pools of resources that
the OSD makes it's own decisions about. I'd like to be able to hand the
OSD a pile of hardware and say "go". That might mean something like an
internal caching scheme but with slow eviction and initial placement
hints (IE L0 SST files should nearly always end up on fast storage).
If it were structured like the PriorityCacheManager, we'd have SSTs for
different column family prefixes (OMAP, onodes, etc) competing for fast
BlueFS device storage with bluestore at different priority levels (so
for example onode L0 would be very high priority) and independent LRUs
for each. I'm hoping some of Igor's work on SST placement might help
make this possible down the road. On the other hand, maybe crimson,
pmem, and cheap high-capacity flash are going to make all of that less
necessary. I guess we'll find out. :)
Thanks,
Mark
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com