> > Not surprising for HDDs.  Double your deep-scrub interval.
> 
> Done!

If your PG ratio is low, say <200, bumping pg_num may help as well.  Oh yeah, 
looking up your gist from a prior message, you average around 70 PG replicas 
per OSD.  Aim for 200.

Your index pool has waaaaay too few PGs.  Set pg_num to 1024.  I’d jack up your 
buckets.data pool to at least 8192 as well.  If you do any MPU at all, I’d 
raise non-ec to 512 or 1024.

> 
> > So you’re relying on the SSD DB device for the index pool?  Have you looked 
> > at your logs / metrics for those OSDs to see if there is any spillover?
> > What type of SSD are you using here?  And how many HDD OSDs do you have 
> > using each? 
> 
> I will try to describe the system as best I can, We are talking about 18 
> different hosts. Each host has a large number of HDDs, and a small number of 
> SSDs (4),
> Out of these SSDs, 2 are used as the backend, to a high speed volume-ssd 
> pool, that certain VMs write into, and the other 2 are split into very large 
> LVM partitions, which act as the journal for the HDDs,

As I suspected.

> I have amended the gist to add that extra information from lsblk. I have not 
> added any information regarding disk models etc. But from the top of my head, 
> each HDD should be about 16T in size, and the NVME is also extremely large 
> and built for high-I/O systems.

There are NVMe devices available that decidedly are not suited for this 
purpose.  The usual rule of thumb that I’ve seen when using TLC-class NVMe 
WAL+DB devices is a max ratio of 10:1 to spinners.  You seem to have 21:1 .

> Each db_devices, if you see in the lsblk, is extremely large so I think there 
> is no spillover.

675GB is the largest WAL+DB partition I've ever seen.

> 
> > Uggh.  If the index pool is entirely on HDDs, with no SSD DB partition, 
> > then yeah any metadata ops are going to be dog slow.  Check that your OSDs 
> > actually do have external SSD DBs — it’s easy over the OSD lifecycle to 
> > deploy that way > > initially but to inadvertently rebuild OSDs without the 
> > external device. 
> 
> I will investigate

`ceph osd metadata` | a suitable grep may show if you have OSDs that aren’t 
actually using the offboard WAL+DB partition.

> and I will start by planning a new pg bump which takes forever due to the 
> size of the cluster for the volumes pool

It takes forever because you have spinners ;). And because with recent Ceph 
releases the cluster throttles the (expensive) PG splitting to prevent DoS.  
Splitting all the PGs at once can be … impactful.

>  AND somehow move the index pool to an osd device before bumping.

Is it only on dedicated NVMes right now?  Which would be what, 36 OSDs?  

With your WAL+DB SSDs having a 21:1 ratio, using them for the index pool 
instead / in addition may or may not improve your performance, but you could 
always move back.

> All this is excellent advice which I thank you for.
> 
> I would like now to ask your opinion on the original query, 
> 
> Do you think that there is some palpable difference between 1 bucket with 10 
> million objects, and 10 buckets with 1 million objects each?

Depends on what you’re measuring.  The second case I suspect would list bucket 
contents faster.

> Intuitively, I feel that the first case would mean interacting with far fewer 
> pgs than the second (10 times less?) which spreads the load on more devices, 
> but my knowledge of ceph internals is nearly 0.
> 
> 
> Regards,
> Harry
> 
> 
> 
> On Tue, Oct 15, 2024 at 4:26 PM Anthony D'Atri <anthony.da...@gmail.com 
> <mailto:anthony.da...@gmail.com>> wrote:
>> 
>> 
>> > On Oct 15, 2024, at 9:28 AM, Harry Kominos <hkomi...@gmail.com 
>> > <mailto:hkomi...@gmail.com>> wrote:
>> > 
>> > Hello Anthony and thank you for your response!
>> > 
>> > I have placed the requested info in a separate gist here:
>> > https://gist.github.com/hkominos/85dc46f3ce7037ec23ac6e1e2535e885
>> 
>> > 3826 pgs not deep-scrubbed in time
>> > 1501 pgs not scrubbed in time
>> 
>> Not surprising for HDDs.  Double your deep-scrub interval.
>> 
>> > Every OSD is an HDD, with their corresponding index, on a partition in an
>> > SSD device.
>> 
>> 
>> So you’re relying on the SSD DB device for the index pool?  Have you looked 
>> at your logs / metrics for those OSDs to see if there is any spillover?
>> 
>> What type of SSD are you using here?  And how many HDD OSDs do you have 
>> using each?
>> 
>> 
>> > And we are talking about 18 separate devices, with separate
>> > cluster_network for the rebalancing etc.
>> 
>> 
>> 18 separate devices?  Do you mean 18 OSDs per server?  18 servers?  Or the 
>> fact that you’re using 18TB HDDs?
>> 
>> > The index for the RGW is also on an HDD (for now).
>> 
>> Uggh.  If the index pool is entirely on HDDs, with no SSD DB partition, then 
>> yeah any metadata ops are going to be dog slow.  Check that your OSDs 
>> actually do have external SSD DBs — it’s easy over the OSD lifecycle to 
>> deploy that way initially but to inadvertently rebuild OSDs without the 
>> external device.  
>> 
>> > Now as far as the number of pgs is concerned, I reached that number,
>> > through one of the calculators that are found online.
>> 
>> You’re using the autoscaler, I see.  
>> 
>> In your `ceph osd df` output, look at the PGS column at right.  Your 
>> balancer seems to be working fairly well.  Your average number of PG 
>> replicas per OSD is around 71, which is in alignment with upstream guidance. 
>>  
>> 
>> But I would suggest going twice as high.  See the very recent thread about 
>> PGs.  So I would adjust pg_num on pools in accordance with their usage and 
>> needs so that the PGS column there ends up in the 150 - 200 range.
>> 
>> > Since the cluster is doing Object store, Filesystem and Block storage, 
>> > each pool has a different
>> > number for pg_num.
>> > In the RGW Data case, the pool has about 300TB in it , so perhaps that
>> > explains that the pg_num is lower than what you expected ?
>> 
>> Ah, mixed cluster.  You shoulda led with that ;)
>> 
>> default.rgw.buckets.data 356.7T 3.0 16440T 0.0651 1.0 4096 off False
>> default.rgw.buckets.index 5693M 3.0 16440T 0.0000 1.0 32 on False
>> default.rgw.buckets.non-ec 62769k 3.0 418.7T 0.0000 1.0 32 
>> volumes 8 16384 2.4 PiB 650.08M 7.2 PiB 53.80 2.1 PiB
>> 
>> You have three pools with appreciable data — the two RBD pools and your 
>> bucket pool.  Your pg_nums are more or less reflective of that, which is 
>> general guidance.
>> 
>> But the index pool is not about data or objects stored.  The index pool is 
>> mainly omaps not RADOS objects, and needs to be resourced differently.
>> Assuming that all 978 OSDs are identical media?  Your `ceph df` output 
>> though implies that you have OSDs on SSDs, so I’ll again request info on the 
>> media and how your OSDs are built.
>> 
>> 
>> Your index pool has only 32 PGs.  I suggest setting pg_num for that pool to, 
>> say, 1024.  It’ll take a while to split those PGs and you’ll see pgp_num 
>> slowly increasing, but when it’s done I strongly suspect that you’ll have 
>> better results.
>> 
>> The non-ec pool is mainly AIUI used for multipart uploads.  If your S3 
>> objects are 4MB in size it probably doesn’t matter.  If you do start using 
>> MPU you’ll want to increase pg_num there too.
>> 
>> 
>> > 
>> > Regards,
>> > Harry
>> > 
>> > 
>> > 
>> > On Tue, Oct 15, 2024 at 2:54 PM Anthony D'Atri <anthony.da...@gmail.com 
>> > <mailto:anthony.da...@gmail.com>>
>> > wrote:
>> > 
>> >> 
>> >> 
>> >>> Hello Ceph Community!
>> >>> 
>> >>> I have the following very interesting problem, for which I found no clear
>> >>> guidelines upstream so I am hoping to get some input from the mailing
>> >> list.
>> >>> I have a 6PB cluster in operation which is currently half full. The
>> >> cluster
>> >>> has around 1K OSD, and the RGW data pool  has 4096 pgs (and pgp_num).
>> >> 
>> >> Even without specifics I can tell you that pg_num is waaaaaaaaaaaaaay too
>> >> low.
>> >> 
>> >> Please send
>> >> 
>> >> `ceph -s`
>> >> `ceph osd tree | head -30`
>> >> `ceph osd df | head -10`
>> >> `ceph -v`
>> >> 
>> >> Also, tell us what media your index and bucket OSDs are on.
>> >> 
>> >>> The issue is as follows:
>> >>> Let's say that we have 10 million small objects (4MB) each.
>> >> 
>> >> In RGW terms, those are large objects.  Small objects would be 4KB.
>> >> 
>> >>> 1)Is there a performance difference *when fetching* between storing all
>> >> 10
>> >>> million objects in one bucket and storing 1 million in 10 buckets?
>> >> 
>> >> Larger buckets will generally be slower for some things, but if you’re on
>> >> Reef, and your bucket wasn’t created on an older release, 10 million
>> >> shouldn’t be too bad.  Listing larger buckets will always be increasingly
>> >> slower.
>> >> 
>> >>> There
>> >>> should be "some" because of the different number of pgs in use, in the 2
>> >>> scenarios but it is very hard to quantify.
>> >>> 
>> >>> 2) What if I have 100 million objects? Is there some theoretical limit /
>> >>> guideline on the number of objects that I should have in a bucket before
>> >> I
>> >>> see performance drops?
>> >> 
>> >> At that point, you might consider indexless buckets, if your
>> >> client/application can keep track of objects in its own DB.
>> >> 
>> >> With dynamic sharding (assuming you have it enabled), RGW defaults to
>> >> 100,000 objects per shard and 1999 max shards, so I *think* that after 
>> >> 199M
>> >> objects in a bucket it won’t auto-reshard.
>> >> 
>> >>> I should mention here that the contents of the bucket *never need to be
>> >>> listed, *The user always knows how to do a curl, to get the contents.
>> >> 
>> >> We can most likely improve your config, but you may also be a candidate
>> >> for an indexless bucket.  They don’t get a lot of press, and I won’t claim
>> >> to be expert in them, but it’s something to look into.
>> >> 
>> >> 
>> >>> 
>> >>> Thank you for your help,
>> >>> Harry
>> >>> 
>> >>> P.S.
>> >>> The following URLs have been very informative, but they do not answer my
>> >>> question unfortunately.
>> >>> 
>> >>> 
>> >> https://www.redhat.com/en/blog/red-hat-ceph-object-store-dell-emc-servers-part-1
>> >>> https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
>> >>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>> >>> <mailto:ceph-users-le...@ceph.io>
>> >> 
>> >> 
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
>> > To unsubscribe send an email to ceph-users-le...@ceph.io 
>> > <mailto:ceph-users-le...@ceph.io>
>> 

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to