> > Not surprising for HDDs. Double your deep-scrub interval. > > Done!
If your PG ratio is low, say <200, bumping pg_num may help as well. Oh yeah, looking up your gist from a prior message, you average around 70 PG replicas per OSD. Aim for 200. Your index pool has waaaaay too few PGs. Set pg_num to 1024. I’d jack up your buckets.data pool to at least 8192 as well. If you do any MPU at all, I’d raise non-ec to 512 or 1024. > > > So you’re relying on the SSD DB device for the index pool? Have you looked > > at your logs / metrics for those OSDs to see if there is any spillover? > > What type of SSD are you using here? And how many HDD OSDs do you have > > using each? > > I will try to describe the system as best I can, We are talking about 18 > different hosts. Each host has a large number of HDDs, and a small number of > SSDs (4), > Out of these SSDs, 2 are used as the backend, to a high speed volume-ssd > pool, that certain VMs write into, and the other 2 are split into very large > LVM partitions, which act as the journal for the HDDs, As I suspected. > I have amended the gist to add that extra information from lsblk. I have not > added any information regarding disk models etc. But from the top of my head, > each HDD should be about 16T in size, and the NVME is also extremely large > and built for high-I/O systems. There are NVMe devices available that decidedly are not suited for this purpose. The usual rule of thumb that I’ve seen when using TLC-class NVMe WAL+DB devices is a max ratio of 10:1 to spinners. You seem to have 21:1 . > Each db_devices, if you see in the lsblk, is extremely large so I think there > is no spillover. 675GB is the largest WAL+DB partition I've ever seen. > > > Uggh. If the index pool is entirely on HDDs, with no SSD DB partition, > > then yeah any metadata ops are going to be dog slow. Check that your OSDs > > actually do have external SSD DBs — it’s easy over the OSD lifecycle to > > deploy that way > > initially but to inadvertently rebuild OSDs without the > > external device. > > I will investigate `ceph osd metadata` | a suitable grep may show if you have OSDs that aren’t actually using the offboard WAL+DB partition. > and I will start by planning a new pg bump which takes forever due to the > size of the cluster for the volumes pool It takes forever because you have spinners ;). And because with recent Ceph releases the cluster throttles the (expensive) PG splitting to prevent DoS. Splitting all the PGs at once can be … impactful. > AND somehow move the index pool to an osd device before bumping. Is it only on dedicated NVMes right now? Which would be what, 36 OSDs? With your WAL+DB SSDs having a 21:1 ratio, using them for the index pool instead / in addition may or may not improve your performance, but you could always move back. > All this is excellent advice which I thank you for. > > I would like now to ask your opinion on the original query, > > Do you think that there is some palpable difference between 1 bucket with 10 > million objects, and 10 buckets with 1 million objects each? Depends on what you’re measuring. The second case I suspect would list bucket contents faster. > Intuitively, I feel that the first case would mean interacting with far fewer > pgs than the second (10 times less?) which spreads the load on more devices, > but my knowledge of ceph internals is nearly 0. > > > Regards, > Harry > > > > On Tue, Oct 15, 2024 at 4:26 PM Anthony D'Atri <anthony.da...@gmail.com > <mailto:anthony.da...@gmail.com>> wrote: >> >> >> > On Oct 15, 2024, at 9:28 AM, Harry Kominos <hkomi...@gmail.com >> > <mailto:hkomi...@gmail.com>> wrote: >> > >> > Hello Anthony and thank you for your response! >> > >> > I have placed the requested info in a separate gist here: >> > https://gist.github.com/hkominos/85dc46f3ce7037ec23ac6e1e2535e885 >> >> > 3826 pgs not deep-scrubbed in time >> > 1501 pgs not scrubbed in time >> >> Not surprising for HDDs. Double your deep-scrub interval. >> >> > Every OSD is an HDD, with their corresponding index, on a partition in an >> > SSD device. >> >> >> So you’re relying on the SSD DB device for the index pool? Have you looked >> at your logs / metrics for those OSDs to see if there is any spillover? >> >> What type of SSD are you using here? And how many HDD OSDs do you have >> using each? >> >> >> > And we are talking about 18 separate devices, with separate >> > cluster_network for the rebalancing etc. >> >> >> 18 separate devices? Do you mean 18 OSDs per server? 18 servers? Or the >> fact that you’re using 18TB HDDs? >> >> > The index for the RGW is also on an HDD (for now). >> >> Uggh. If the index pool is entirely on HDDs, with no SSD DB partition, then >> yeah any metadata ops are going to be dog slow. Check that your OSDs >> actually do have external SSD DBs — it’s easy over the OSD lifecycle to >> deploy that way initially but to inadvertently rebuild OSDs without the >> external device. >> >> > Now as far as the number of pgs is concerned, I reached that number, >> > through one of the calculators that are found online. >> >> You’re using the autoscaler, I see. >> >> In your `ceph osd df` output, look at the PGS column at right. Your >> balancer seems to be working fairly well. Your average number of PG >> replicas per OSD is around 71, which is in alignment with upstream guidance. >> >> >> But I would suggest going twice as high. See the very recent thread about >> PGs. So I would adjust pg_num on pools in accordance with their usage and >> needs so that the PGS column there ends up in the 150 - 200 range. >> >> > Since the cluster is doing Object store, Filesystem and Block storage, >> > each pool has a different >> > number for pg_num. >> > In the RGW Data case, the pool has about 300TB in it , so perhaps that >> > explains that the pg_num is lower than what you expected ? >> >> Ah, mixed cluster. You shoulda led with that ;) >> >> default.rgw.buckets.data 356.7T 3.0 16440T 0.0651 1.0 4096 off False >> default.rgw.buckets.index 5693M 3.0 16440T 0.0000 1.0 32 on False >> default.rgw.buckets.non-ec 62769k 3.0 418.7T 0.0000 1.0 32 >> volumes 8 16384 2.4 PiB 650.08M 7.2 PiB 53.80 2.1 PiB >> >> You have three pools with appreciable data — the two RBD pools and your >> bucket pool. Your pg_nums are more or less reflective of that, which is >> general guidance. >> >> But the index pool is not about data or objects stored. The index pool is >> mainly omaps not RADOS objects, and needs to be resourced differently. >> Assuming that all 978 OSDs are identical media? Your `ceph df` output >> though implies that you have OSDs on SSDs, so I’ll again request info on the >> media and how your OSDs are built. >> >> >> Your index pool has only 32 PGs. I suggest setting pg_num for that pool to, >> say, 1024. It’ll take a while to split those PGs and you’ll see pgp_num >> slowly increasing, but when it’s done I strongly suspect that you’ll have >> better results. >> >> The non-ec pool is mainly AIUI used for multipart uploads. If your S3 >> objects are 4MB in size it probably doesn’t matter. If you do start using >> MPU you’ll want to increase pg_num there too. >> >> >> > >> > Regards, >> > Harry >> > >> > >> > >> > On Tue, Oct 15, 2024 at 2:54 PM Anthony D'Atri <anthony.da...@gmail.com >> > <mailto:anthony.da...@gmail.com>> >> > wrote: >> > >> >> >> >> >> >>> Hello Ceph Community! >> >>> >> >>> I have the following very interesting problem, for which I found no clear >> >>> guidelines upstream so I am hoping to get some input from the mailing >> >> list. >> >>> I have a 6PB cluster in operation which is currently half full. The >> >> cluster >> >>> has around 1K OSD, and the RGW data pool has 4096 pgs (and pgp_num). >> >> >> >> Even without specifics I can tell you that pg_num is waaaaaaaaaaaaaay too >> >> low. >> >> >> >> Please send >> >> >> >> `ceph -s` >> >> `ceph osd tree | head -30` >> >> `ceph osd df | head -10` >> >> `ceph -v` >> >> >> >> Also, tell us what media your index and bucket OSDs are on. >> >> >> >>> The issue is as follows: >> >>> Let's say that we have 10 million small objects (4MB) each. >> >> >> >> In RGW terms, those are large objects. Small objects would be 4KB. >> >> >> >>> 1)Is there a performance difference *when fetching* between storing all >> >> 10 >> >>> million objects in one bucket and storing 1 million in 10 buckets? >> >> >> >> Larger buckets will generally be slower for some things, but if you’re on >> >> Reef, and your bucket wasn’t created on an older release, 10 million >> >> shouldn’t be too bad. Listing larger buckets will always be increasingly >> >> slower. >> >> >> >>> There >> >>> should be "some" because of the different number of pgs in use, in the 2 >> >>> scenarios but it is very hard to quantify. >> >>> >> >>> 2) What if I have 100 million objects? Is there some theoretical limit / >> >>> guideline on the number of objects that I should have in a bucket before >> >> I >> >>> see performance drops? >> >> >> >> At that point, you might consider indexless buckets, if your >> >> client/application can keep track of objects in its own DB. >> >> >> >> With dynamic sharding (assuming you have it enabled), RGW defaults to >> >> 100,000 objects per shard and 1999 max shards, so I *think* that after >> >> 199M >> >> objects in a bucket it won’t auto-reshard. >> >> >> >>> I should mention here that the contents of the bucket *never need to be >> >>> listed, *The user always knows how to do a curl, to get the contents. >> >> >> >> We can most likely improve your config, but you may also be a candidate >> >> for an indexless bucket. They don’t get a lot of press, and I won’t claim >> >> to be expert in them, but it’s something to look into. >> >> >> >> >> >>> >> >>> Thank you for your help, >> >>> Harry >> >>> >> >>> P.S. >> >>> The following URLs have been very informative, but they do not answer my >> >>> question unfortunately. >> >>> >> >>> >> >> https://www.redhat.com/en/blog/red-hat-ceph-object-store-dell-emc-servers-part-1 >> >>> https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond >> >>> _______________________________________________ >> >>> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> >> >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >>> <mailto:ceph-users-le...@ceph.io> >> >> >> >> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> >> > To unsubscribe send an email to ceph-users-le...@ceph.io >> > <mailto:ceph-users-le...@ceph.io> >> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io