Hi Greg,

thanks for chiming in here.

> ... presumably because the current sizing guidelines are generally good 
> enough to be getting on with ...

That's exactly why I'm bringing this up with such insistence. The guidelines 
are *not* good enough for EC pools on large HDDs that store a high percentage 
of small objects, in our case, files. In fact, they are really bad in that case 
and there were a number of recent ceph-user threads where a significant 
increase in PG count would probably help a lot. Including problems caused by 
very high object count per PG I'm dancing around on our cluster.

For the guessing where the recommendation comes from, I'm actually leaning 
towards the "PGs should be limited in size" explanation. The recommendation of 
100 PGs per OSD was good enough for a very long time together with bugs or 
observations you mention where it was never really assessed what the actual 
cause was or what resources are actually needed per PG compared with the total 
object count per OSD.

PGs were originally invented to chunk up large disks for distributed RAID. To 
keep all-to-all rebuild time constant independent of the scale of the cluster. 
That's how you get scale-out capability. A fixed PG count counteracts that with 
the insane increase of capacity per disk we have lately. That's why I actually 
lean towards that the recommendation was intended to keep PGs below 5-10G each 
(and or <N objects) and was never updated with hardware developments.

I have serious problems seeing how the PG count could be a single number 
screwing a cluster up. Peering, recovery, rocksdb size, everything is tied to 
the object count of an OSD. PGs just split this up into smaller units that are 
easier to manage. As a principle, for *any* problem with non-linear complexity 
(greater than linear complexity), solving M problems of size N/M is easier than 
solving 1 problem of size N. So, increasing the PG count should *improve* 
things just out of this principle. Unless there is a serious implementation 
problem I really don't understand why anyone would claim the opposite. If there 
is such an implementation problem, please anyone come forward.

So I question here the anecdotal reports about the PG count being to blame 
alone. There have been a number of bugs discovered that were triggered by PG 
splitting. That these bugs are more likely hit when using high PG counts is 
kind of obvious. So its per se not the PG count that's the problem.

Testing and experiments could be useful to update the guidelines. However, a 
good look at the code of a PG code maintainer would probably be faster and if 
there is something problematic it would be better to refer to the code than to 
experiments that might have missed the critical section. So the question really 
is, is there a piece of code that is more than quadratic in the PG count in any 
resource? Worse yet, is there something exponential? If there is something like 
that, there is no point making experiments.

If there is nothing like that in the code, its worth conducting experiments and 
provide a table with resource usage depending on PG count. That would be very 
much appreciated.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Gregory Farnum <gfar...@redhat.com>
Sent: Thursday, October 10, 2024 10:19 AM
To: Frank Schilder
Cc: Janne Johansson; Anthony D'Atri; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: What is the problem with many PGs per OSD

Yes, this was an old lesson and AFAIK nobody has intentionally pushed the 
bounds in a long time because it was a very painful lesson for anybody who ran 
into it.

The main problem was the increase in ram use scaling with PGs, which in normal 
operation is often fine but as we all know balloons in failure conditions.

There are many developments that may have made things behave better, but early 
on some clusters just couldn’t be recovered until they received double their 
starting ram and were babysat through careful manually-orchestrated startup. 
(Or maybe worse — I forget.)

Nobody’s run experiments, presumably because the current sizing guidelines are 
generally good enough to be getting on with, for anybody who has the resources 
to try and engage in the measurement work it would take to re-validate them. I 
will be surprised if anybody has information of the sort you seem to be 
searching for.
-Greg

On Thu, Oct 10, 2024 at 12:13 AM Frank Schilder 
<fr...@dtu.dk<mailto:fr...@dtu.dk>> wrote:
Hi Janne.

> To be fair, this number could just be something vaguely related to
> "spin drives have 100-200 iops" ...

It could be, but is it? Or is it just another rumor? I simply don't see how the 
PG count could possibly impact Io load on a disk.

How about this guess: It could be dragged along from a time when HDDs were <=1T 
and it simply means to have PGs not larger than 10G. Sounds reasonable, but is 
it?

I think we should really stop second-guessing here. This discussion was not 
meant to be a long thread where we all just guess but never know. I would 
appreciate if someone who actually knows something about why this 
recommendation is really there would ship in here. As far as I can tell, it 
could be anything or nothing. I actually tend toward its nothing, it was just 
never updated along with new developments and nowadays nobody knows any more.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Janne Johansson <icepic...@gmail.com<mailto:icepic...@gmail.com>>
Sent: Thursday, October 10, 2024 8:51 AM
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: Re: [ceph-users] Re: What is the problem with many PGs per OSD

Den ons 9 okt. 2024 kl 20:48 skrev Frank Schilder 
<fr...@dtu.dk<mailto:fr...@dtu.dk>>:

> The PG count per OSD is a striking exception. Its just a number (well a range 
> with 100 recommended and 200 as a max: 
> https://docs.ceph.com/en/latest/rados/operations/pgcalc/#keyDL). It just is. 
> And this doesn't make any sense unless there is something really evil lurking 
> in the dark.
> For comparison, a guidance that does make sense is something like 100PGs per 
> TB. That I would vaguely understand: to keep the average PG size constant at 
> a max of about 10G.

To be fair, this number could just be something vaguely related to
"spin drives have 100-200 iops", and while cent/rhel linux kernels 10
years ago did have some issues in getting io done in parallel as much
as possible towards a single device, doing multiple OSDs on flash
devices would have been both a way to get around this limitation in
the IO middle layer, and a way to "tell" ceph it can send more IO to
the device since it has multiple OSDs on it.

--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to