[ceph-users] Re: Ceph S3 Performance: HDDs for data + SSDs for metadata – Best practice?

Anthony D'Atri Wed, 04 Jun 2025 07:44:53 -0700

> 
>> Hi Michael,
>> 
>> So challenges with Veeam backups are more around metadata.


It’s conventional wisdom that Veeam uses tiny blocks (S3 objects) by default 
and that one should enable a “large block” setting.

>> I do hope when you said enterprise SSD’s you meant NVME’s?

SAS and SATA SSDs cost the same as NVMe SSDs with careful procurement, but they 
aren’t necessarily fatal.  Remember that NVMe devices *are* SSDs.  No CRTs or 
gamma ferric oxide in play.


>> By default Veeam will write everything into one bucket. The problem with
>> this depending on the size of your environment is you will get into bucket
>> sharding issues. You can also get issues with large OMAP objects.

In a metadata-heavy deployment with a relatively small number of large SSDs, it 
would seem advantageous to still split the drives into 2-3 OSDs just to get 
more RocksDB instances across which to spread omaps.  And to avoid versioning 
when at all possible.  And to use a recent Ceph release, Quincy or later IIRC, 
that shards RocksDB across column families, with legacy OSDs retrofitted via 
manual action.  Recently implemented RocksDB compression should help too.

>> Newer versions of Veeam do allow for using multiple buckets at the backend
>> which I would strongly suggest you use.

Agreed.  Outside buckets can be difficult.


>> I would strongly suggest you put in some additional NVME’s to allow you to
>> move the S3 metadata onto these devices to allow for better performance.

The OP described that already:

>>> - Enterprise SSDs for bluestore_block.db and bluestore_block.wal
>>> (metadata and journal)

Remember that omaps are (currently) stored in RocksDB. This approach can 
actually be
advantageous for RGW metadata since it spreads the bucket indexes across a much 
larger number of RocksDB instances.


>>> I'm currently planning a Ceph-based S3 storage backend, with a strong
>>> focus on balancing performance and cost (EUR per TB).

Plug your numbers into the SNIA TCO calculator.

https://www.snia.org/forums/cmsi/programs/TCOcalc

>>> Fully SSD-based setups seem too expensive for my use case

The TCO might surprise you.  Consider having to procure, manage, and house a 
greater number of chassis.  RUs for most of us aren’t free, nor or TORs.

A chassis built to handle LFF HDDs will usually be taller than an SFF/EDSFF 
SSD-only server, and you’ll likely be fussing with brittle adapters to stick a 
small SSD into that big, space-wasting slot.

Toploaders can suffer from HBA and backplane saturation, not to mention a 
thundering herd when recovery happens.  Plan for peak usage.

Don’t just look at the unit economics of the drive, look at the holistic 
economics of the *drive bay*, i.e. everything else.

Today’s 30TB spinners (be sure to avoid SMR!) use the same tired interface as 
3TB drives a dozen years ago.  Bottleneck city.

If you spent more on SAS spinners, that’s money better spent on modern media, 
especially since you will need an AIC HBA, which most likely will be 
RAID-enabled, with FBWC and a BBU.  These can easily have a list price of USD 
2000.

Universal drive bays cost more too.  SAS is distinctly sunsetting in the 
enterprise SSD market, and SATA isn’t far beyond.  Consider that for the — 
let’s be realistic — 10 year lifetime of servers purchased today.

Be careful of the trap of deploying a small number of ultradense toploaders.  
And don’t make the mistake of an overly wide EC profile.
8+3 means that every write ties up 11 drives, and you want at lest 12 failure 
domains.  This affects recovery and scrubs.  

A deployment that does not meet your needs is a false economy.

A deployment with 30-122TB QLC SSDs will be rather more RU-dense, saving on DC 
spend, and much better able to handle small object hotspots and the increased 
demands of recovery.  If a failed drive takes you two weeks to recover from, 
that’s two weeks of increased risk of data unavailability or loss.


>>> , while HDD-only configurations might be too slow.

Generally, yes.  Especially if you have lots of small S3 objects.  And 
performance when you test with 1% full can be very different from down the road 
when you’re 60% full and fragmented.

>>> 
>>> My current idea is to use a hybrid approach:
>>> - HDDs for bluestore_block (bulk data)
>>> - Enterprise SSDs for bluestore_block.db and bluestore_block.wal
>>> (metadata and journal)

Not exactly a journal, which is why it isn’t called one, but that’s a nit.

>>> 
>>> The main client of this S3 backend will be Veeam, which, as far as I
>>> can tell, uses larger block sizes.

My understanding is that it needs to be configured to.

>>> Since this results in fewer random writes and mostly sequential I/O,

Ceph OSDs usually experience random IO regardless, the IO blender effect.  
Unless maybe if you only have a single client at a time.

>>> I assume this kind of workload can work well with HDDs — as long as
>>> metadata is fast (hence SSDs for DB/WAL).

Remember that hybrid OSDs may increase metadata performance and latency for 
small writes, but not necessarily throughput.  

>>> 
>>> - What SSD-to-HDD ratio would you recommend? (I’ve seen 1 SSD per 3–5 HDDs)

Conventional wisdom is at most 5 per SAS/SATA SSD, 10 per NVMe SSD.


>> https://www.hyperscalers.com/image/catalog/00-Products/Storage%20Servers/White%20Paper%20-%20Performance%20Testing%20of%20Ceph%20with%20SD1Q-1ULH.pdf
>>> => which mentions Ceph Verstion 0.94.5  and the performance boot does

Copyright 2015.  I think I saw QCT present this at a Ceph Day back in the mists 
of time.

Hammer does not in any way reflect the modern experience.  Filestore OSDs, no 
Manager factored out of the mons yet, etc.

>>> not seem too big here
>>> 
>>> And here:
>>> https://www.ambedded.com.tw/en/use-case/use-case-08.html
>>> they dont mention such a Hybrid Setup

It’s …novel.

>>> 
>>> Thanks a lot in advance!
>>> 
>>> Best regards,
>>> Michael
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> 
> 
> -- 
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph S3 Performance: HDDs for data + SSDs for metadata – Best practice?

Reply via email to