Hi, I asked a similar question about increasing scrub throughput some time ago 
and couldn't get a fully satisfying answer: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NHOHZLVQ3CKM7P7XJWGVXZUXY24ZE7RK

My observation is that much fewer (deep) scrubs are scheduled than could be 
executed. Some people wrote scripts to do scrub scheduling in a more efficient 
way (by last-scrub time stamp), but I don't want to go this route (yet). 
Unfortunately, the thread above does not contain the full conversation, I think 
it forked into a second one with the same or a similar title.

About performance calculations, along the lines of

> they were never setup to have enough IOPS to support the maintenance load,
> never mind the maintenance load plus the user load

> initially setup, it is nearly empty, so it appears
> to perform well even if it was setup with inexpensive but slow/large
> HDDs, then it becomes fuller and therefore heavily congested

There is a bit more to that. HDDs have the unfortunate property that sector 
reads/writes are not independent of which sector is read/written to. An empty 
drive will serve IO from the beginning of the disk when everything is fast. As 
drives fill up, they start using slower and slower regions. This performance 
degradation is in addition to the effects of longer seek paths and 
fragmentation.

Here I'm talking only about enterprise data centre drives with proper sustained 
performance profiles, not cheap stuff that falls apart once you go serious.

Unfortunately, ceph adds on top of that the lack of tail merging support, which 
makes small objects extra expensive.

Still, ceph was written for HDDs and actually performs well if IO calculations 
are done properly. For example, 8TB vs. 18TB drives. 8TB drives start with 
about 150MB/s bandwidth at the fast part and slow down to 80-100MB/s when you 
reach the end. 18TB drives are not just 8TB drives with denser packing, they 
actually have more platters. That means, they start out at 250MB/s and reach 
something like 100-130MB/s towards the end. Its more than double the capacity, 
but not more than double the throughput. IOP/s are roughly the same, so IOP/s 
per TB go down a lot with capacity.

When is this fine and when is it problematic. Its fine if you have large 
objects that are never modified. Then ceph will usually reach sequential 
read/write performance and scrubbing will be done within a week (with less than 
10% utilisation, which is good). The other extreme is many small objects, in 
which case your observed performance/throughput can be terrible and scrubbing 
might never end.

For being able to make reasonable estimates, you need to know real-life object 
size distributions and if full object writes are effectively sequential 
(meaning you have large bluestore alloc sizes in general, look at the bluestore 
performance counters, it will indicate how many large and how many small writes 
you have).

We have a fairly mixed size distribution with, unfortunately, quite a 
percentage of small objects on our ceph fs. We do have 18T drives, which are 
about 30% utilised. Scrubbing still finishes within less than 2 weeks even with 
the outliers due to "not ideal" scrub scheduling (thread above). I'm willing to 
accept up to 4 weeks tail time, which will probably give me 50-60% utilisation 
before things go below acceptable.

In essence, the 18T average performance drives are something like 10T pretty 
good performance drives compared with the usual 8T drives. You just have to let 
go of 100% capacity utilisation. The limit is what comes first, capacity- or 
IOP/s saturation. Once admin workload cannot complete in time, that's it, the 
disks are full and one needs to expand.

We have about 900 HDDs in our cluster and I maintain this large number mostly 
for performance reasons. I don't think I will ever see more than 50% 
utilisation before we change deployment or add drives.

Looking at our data in more detail, most of it is ice cold. Therefore, in the 
long run we plan to go for tiered OSDs (bcache/dm-cache) with sufficient total 
SSD capacity to hold about 2 times all hot data. Then, maybe, we can fill big 
drives a bit more.

I was looking into large capacity SSDs and, I'm afraid, when going to the 
>=18TB SSD section they either have bad and often worse performance than 
spinners, or are massively expensive. With performance here I mean bandwidth. 
Large SSDs can have a sustained bandwith of 30MB/s. They will still do about 
500-1000IOP/s per TB, but large file transfer or backfill will become a pain.

I looked at models with reasonable bandwidth and asked if I could get a price. 
The answer was that one such disk costs more than an entire of our standard 
storage servers. Clearly not our league. A better solution is to combine the 
best of both worlds and have a more intelligent software that can differentiate 
between hot and cold data and may be able to adapt to workloads.

> the best that can be said about those HDDs is that they should
> be considered "tapes" with some random access ability

Which is good if that is all you need. But true, a lot of people already forget 
that using an 8+3 EC profile on a pool will divide the aggregated IOP/s budget 
by 11. After this, divide by 2 and you have a number to tell your users/boss. 
They are either happy or give you more money.

Our users also think in terms of price/TB only. I simply incorporate 
performance into the calculation and come up with price per *usable* TB. Raw 
capacity includes admin overhead (which includes IOP/s), which can easily be 
50% in total plus the replication overhead. Just let go of 100% capacity 
utilisation and you will have a well working cluster. I let go of 50% 
utilisation. That's when I start requesting material and it works really well. 
Still much cheaper than an all-flash install with higher utilisation.

To the all-flash enthusiasts. Yes, we have all-flash pools and I do enjoy their 
performance. Still, the price. There are people who say platters are outdated 
and SSDs are competitive. Well, my google-fu is maybe not good enough, so here 
we go. If you show me where I can get SSDs with the specs below, I will go 
all-flash. Until then, sorry, cost economy is still a thing.

Specs A:

- capacity: 18TB+
- sustained 1M block-size sequential read/write (iodepth=1): 15MB/s per TB
- sustained 4K random 50/50 read-write (iodepth=1): 100
- data written per day for 5 years: 1TB (yes, this *is* very low yet sufficient)
- interface: SATA/SAS, 2.5" or 3.5"
- price: <=350$ (for 18TB)

Specs B:

- capacity: 18TB+
- sustained 1M block-size sequential read/write (iodepth=1): 25MB/s per TB
- sustained 4K random 50/50 read-write (iodepth=1): 1000
- data written per day for 5 years: 1TB (yes, this *is* very low yet sufficient)
- interface: SATA/SAS, 2.5" or 3.5"
- price: <=700$ (for 18TB)

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Peter Grandi <p...@ceph.list.sabi.co.uk>
Sent: Thursday, April 27, 2023 11:55 AM
To: list fs Ceph
Subject: [ceph-users] Re: Deep-scrub much slower than HDD speed

 > On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only
 > numbers already divided by replication factor), you need 55 days
 > to scrub it once.
 > That's 8x larger than the default scrub factor [...] Also, even
 > if I set the default scrub interval to 8x larger, it my disks
 > will still be thrashing seeks 100% of the time, affecting the
 > cluster's  throughput and latency performance.

Indeed! Every Ceph instance I have seen (not many) and almost every HPC
storage system I have seen have this problem, and that's because they
were never setup to have enough IOPS to support the maintenance load,
never mind the maintenance load plus the user load (and as a rule not
even the user load).

There is a simple reason why this happens: when a large Ceph (etc.
storage instance is initially setup, it is nearly empty, so it appears
to perform well even if it was setup with inexpensive but slow/large
HDDs, then it becomes fuller and therefore heavily congested but whoever
set it up has already changed jobs or been promoted because of their
initial success (or they invent excuses).

A figure-of-merit that matters is IOPS-per-used-TB, and making it large
enough to support concurrent maintenance (scrubbing, backfilling,
rebalancing, backup) and user workloads. That is *expensive*, so in my
experience very few storage instance buyers aim for that.

The CERN IT people discovered long ago that quotes for storage workers
always used very slow/large HDDs that performed very poorly if the specs
were given as mere capacity, so they switched to requiring a different
metric, 18MB/s transfer rate of *interleaved* read and write per TB of
capacity, that is at least two parallel access streams per TB.

https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"

BTW I am not sure that a floor of 18MB/s of interleaved read and write
per TB is high enough to support simultaneous maintenance and user loads
for most Ceph instances, especially in HPC.

I have seen HPC storage systems "designed" around 10TB and even 18TB
HDDs, and the best that can be said about those HDDs is that they should
be considered "tapes" with some random access ability.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to