[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Frank Schilder
> But not, I suspect, nearly as many tentacles. No, that's the really annoying part. It just works. = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Anthony D'Atri Sent: Thursday, October 10, 2024 2:13 PM To: Frank Schilder Cc:

[ceph-users] Re: About 100g network card for ceph

2024-10-10 Thread Anthony D'Atri
> I would treat having a separate cluster network > at all as a serious cluster design bug. I wouldn’t go quite that far, there are still situations where it can be the right thing to do. Like if one is stuck with only 1GE or 10GE networking, but NICs and switch ports abound. Then having sepa

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Anthony D'Atri
> I'm afraid nobody will build a 100PB cluster with 1T drives. That's just > absurd Check the archives for the panoply of absurdity that I’ve encountered ;) > So, the sharp increase of per-device capacity has to be taken into account. > Specifically as the same development is happening with S

[ceph-users] Re: rgw connection resets

2024-10-10 Thread laimis . juzeliunas
Hi Nathan, Any luck on your case? We're observing the same errors with 18.4.2 in the RGW daemons with Haproxy in front and are yet to figure out the root cause. Additionally our logs get hit from time to time with the following: debug 2024-10-10T09:31:42.213+ 7f51dd83d640 0 req 8849896936637

[ceph-users] Re: Procedure for temporary evacuation and replacement

2024-10-10 Thread Wesley Dillingham
I dont think your plan will work as expected. In step 3 you will introduce additional data movement with the manner in which you have tried to accomplish this. I suggest you do set the CRUSH weight to 0 for the OSD in which you intend to replace; do this for all OSDs you wish to replace whilst th

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Frank Schilder
Hi Peter, thanks for your comment. So it is mainly related to PG size. Unfortunately, we need to have a reality check here: > It was good for the intended use case, lots of small (by today's > standards, around 1TB) OSDs on many servers working in parallel. > > Note: HDDs larger than 1TB are not

[ceph-users] Re: Procedure for temporary evacuation and replacement

2024-10-10 Thread Anthony D'Atri
> > We need to replace about 40 disks distributed over all 12 hosts backing a > large pool with EC 8+3. We can't do it host by host as it would take way too > long (replace disks per host and let recovery rebuild the data) This is one of the false economies of HDDs ;) > Therefore, we would l

[ceph-users] Re: Erasure coding scheme 2+4 = good idea?

2024-10-10 Thread Frank Schilder
I would like to recommend to look at stretch-mode. There have been discussions on this list about the reliability of 2-DC set-ups and just using crush rules to distribute shards doesn't cut it. There are corner cases that are only handled correctly by stretch mode if the system needs to be up wi

[ceph-users] Re: About 100g network card for ceph

2024-10-10 Thread Alexander Patrakov
Hi Phong, You should bond them. I would treat having a separate cluster network at all as a serious cluster design bug. Reason: a single faulty NIC or cable or switch port on the backend network can bring down the whole cluster. This is even documented: https://docs.ceph.com/en/latest/rados/troub

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Peter Grandi
>>> On Thu, 10 Oct 2024 08:53:08 +, Frank Schilder said: > The guidelines are *not* good enough for EC pools on large > HDDs that store a high percentage of small objects, in our > case, files. Arguably *nothing* is good enough for that, because it is the worst possible case scenario (A Ceph

[ceph-users] Re: Erasure coding scheme 2+4 = good idea?

2024-10-10 Thread Andre Tann
Hi Bill Am 10.10.24 um 10:57 schrieb Bill Scales: 2. Replica pools can be configured to support local reads where clients send read I/O requests to an OSD at the same site. For erasure coded pools all read I/O must be sent via the primary OSD which half the time will be on the remote site.

[ceph-users] Re: Erasure coding scheme 2+4 = good idea?

2024-10-10 Thread Andre Tann
Hi Frank, Am 10.10.24 um 11:13 schrieb Frank Schilder: I would like to recommend to look at stretch-mode. There have been discussions on this list about the reliability of 2-DC set-ups and just using crush rules to distribute shards doesn't cut it. There are corner cases that are only handled c

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Anthony D'Atri
> The main problem was the increase in ram use scaling with PGs, which in > normal operation is often fine but as we all know balloons in failure > conditions. Less so with BlueStore in my experience. I think in part this surfaces a bit of Filestore legacy that we might re-examine with Filestor

[ceph-users] Re: Procedure for temporary evacuation and replacement

2024-10-10 Thread Frank Schilder
Thanks Anthony and Wesley for your input. Let me explain in more detail why I'm interested in the somewhat obscure looking procedure in step 1. Whats the difference between "ceph osd reweight" and "ceph osd crush reweight"? the difference is that command 1 only remaps shards within the same fai

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Frank Schilder
Hi Janne. > To be fair, this number could just be something vaguely related to > "spin drives have 100-200 iops" ... It could be, but is it? Or is it just another rumor? I simply don't see how the PG count could possibly impact Io load on a disk. How about this guess: It could be dragged along

[ceph-users] Re: Forced upgrade OSD from Luminous to Pacific

2024-10-10 Thread Frédéric Nass
Cool. Glad you made it through. ;-) Regards, Frédéric. - Le 9 Oct 24, à 16:46, Alex Rydzewski rydzewski...@gmail.com a écrit : > Great thanks, Frédéric! > > It seems that --yes-i-really-mean-it helped. The cluster rebuilding now > and I can access my data on it! > > On 09.10.24 15:48, Fréd

[ceph-users] Re: Erasure coding scheme 2+4 = good idea?

2024-10-10 Thread Bill Scales
Hi, > Is it possible to recover two data chunks out of 2 coding chunks? Yes. > What do you think about 2+4, is it a good idea or a bad one? Some differences between a replica pool and an erasure code pool to consider: 1. If an OSD fails there will be a lot more network traffic between the site

[ceph-users] Packets Drops in bond interface

2024-10-10 Thread Alfredo Rezinovsky
I'm having lots of those warning in all nodes: CephNodeNetworkPacketDrops Node ceph-06 experiences packet drop > 0.5% or > 10 packets/s on interface bond0. bonds are 802.3ad with Mikrotik switches. Physical interfaces are OK, without drops. They are Intel e1000 dual PCIe NICs Running with kern

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Anthony D'Atri
> but simply on the physical parameter of IOPS-per-TB (a "figure of merit" that > is widely underestimate or ignored) hear hear! > of HDDs, and having enough IOPS-per-TB to sustain both user and admin > workload. Even with SATA SSDs I twice had to expand a cluster to meet SLO long before it

[ceph-users] Re: Procedure for temporary evacuation and replacement

2024-10-10 Thread Wesley Dillingham
If you are replacing the OSDs with the same size/weight device, I agree with your reweight approach. I've been doing some similar work myself that does require crush reweighting to 0 and have been in that headspace. I did a bit of testing around this: - Even with the lowest possible reweight an O

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Anthony D'Atri
> Hi Anthony. > >> ... Bump up pg_num on pools and see how the average / P90 ceph-osd process >> size changes? >> Grafana FTW. osd_map_cache_size I think defaults to 50 now; I want to say >> it used to be much higher. > > That's not an option. What would help is a-priori information based on

[ceph-users] Re: Procedure for temporary evacuation and replacement

2024-10-10 Thread Marc
For 1s I thought you were in Florida! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Gregory Farnum
Yes, this was an old lesson and AFAIK nobody has intentionally pushed the bounds in a long time because it was a very painful lesson for anybody who ran into it. The main problem was the increase in ram use scaling with PGs, which in normal operation is often fine but as we all know balloons in fa

[ceph-users] Procedure for temporary evacuation and replacement

2024-10-10 Thread Frank Schilder
Hi all, a hopefully simple question this time. I would like a second opinion on a procedure for replacing a larger number of disks. We need to replace about 40 disks distributed over all 12 hosts backing a large pool with EC 8+3. We can't do it host by host as it would take way too long (repla

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Frank Schilder
Hi Greg, thanks for chiming in here. > ... presumably because the current sizing guidelines are generally good > enough to be getting on with ... That's exactly why I'm bringing this up with such insistence. The guidelines are *not* good enough for EC pools on large HDDs that store a high perc

[ceph-users] Re: The ceph monitor crashes every few days

2024-10-10 Thread Gregory Farnum
On Wed, Oct 9, 2024 at 7:28 AM 李明 wrote: > Hello, > > ceph version is 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) > nautilus (stable) > > and the rbd info command is also slow, some times it needs 6 seconds. rbd > snap create command takes 17 seconds. There is another cluster with the >

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Peter Grandi
> [... number of PGs per OSD ...] > So it is mainly related to PG size. Indeed and secondarily number of objects: many objects per PG mean lower metadata overhead, but bigger PGs mean higher admin workload latency. >> Note: HDDs larger than 1TB are not really suitable for >> significant parallel

[ceph-users] a potential rgw data loss issue for awareness

2024-10-10 Thread Jane Zhu (BLOOMBERG/ 120 PARK)
We identified a potential rgw data loss situation on versioned bucket in multisite settings. Please see the tracker for details: https://tracker.ceph.com/issues/68466. This is affecting Reef, Squid and main. Earlier versions have not been tested though.

[ceph-users] Ubuntu 24.02 LTS Ceph status warning

2024-10-10 Thread Dominique Ramaekers
I manage a 4 hosts cluster on Ubuntu 22.04 LTS with ceph installed trough cephad and containers on Docker. Last month, I've migrated to the latest Ceph 19.2. All went great. Last week I've upgraded one of my hosts to Ubuntu 24.04.1 LTS Now I get the following warning in cephadm shell -- ceph sta