> But not, I suspect, nearly as many tentacles.
No, that's the really annoying part. It just works.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Anthony D'Atri
Sent: Thursday, October 10, 2024 2:13 PM
To: Frank Schilder
Cc:
> I would treat having a separate cluster network
> at all as a serious cluster design bug.
I wouldn’t go quite that far, there are still situations where it can be the
right thing to do. Like if one is stuck with only 1GE or 10GE networking, but
NICs and switch ports abound. Then having sepa
> I'm afraid nobody will build a 100PB cluster with 1T drives. That's just
> absurd
Check the archives for the panoply of absurdity that I’ve encountered ;)
> So, the sharp increase of per-device capacity has to be taken into account.
> Specifically as the same development is happening with S
Hi Nathan,
Any luck on your case? We're observing the same errors with 18.4.2 in the RGW
daemons with Haproxy in front and are yet to figure out the root cause.
Additionally our logs get hit from time to time with the following:
debug 2024-10-10T09:31:42.213+ 7f51dd83d640 0 req 8849896936637
I dont think your plan will work as expected.
In step 3 you will introduce additional data movement with the manner in
which you have tried to accomplish this.
I suggest you do set the CRUSH weight to 0 for the OSD in which you intend
to replace; do this for all OSDs you wish to replace whilst th
Hi Peter,
thanks for your comment. So it is mainly related to PG size. Unfortunately, we
need to have a reality check here:
> It was good for the intended use case, lots of small (by today's
> standards, around 1TB) OSDs on many servers working in parallel.
>
> Note: HDDs larger than 1TB are not
>
> We need to replace about 40 disks distributed over all 12 hosts backing a
> large pool with EC 8+3. We can't do it host by host as it would take way too
> long (replace disks per host and let recovery rebuild the data)
This is one of the false economies of HDDs ;)
> Therefore, we would l
I would like to recommend to look at stretch-mode. There have been discussions
on this list about the reliability of 2-DC set-ups and just using crush rules
to distribute shards doesn't cut it. There are corner cases that are only
handled correctly by stretch mode if the system needs to be up wi
Hi Phong,
You should bond them. I would treat having a separate cluster network
at all as a serious cluster design bug. Reason: a single faulty NIC or
cable or switch port on the backend network can bring down the whole
cluster. This is even documented:
https://docs.ceph.com/en/latest/rados/troub
>>> On Thu, 10 Oct 2024 08:53:08 +, Frank Schilder said:
> The guidelines are *not* good enough for EC pools on large
> HDDs that store a high percentage of small objects, in our
> case, files.
Arguably *nothing* is good enough for that, because it is the
worst possible case scenario (A Ceph
Hi Bill
Am 10.10.24 um 10:57 schrieb Bill Scales:
2. Replica pools can be configured to support local reads where clients
send read I/O requests to an OSD at the same site. For erasure coded
pools all read I/O must be sent via the primary OSD which half the time
will be on the remote site.
Hi Frank,
Am 10.10.24 um 11:13 schrieb Frank Schilder:
I would like to recommend to look at stretch-mode. There have been
discussions on this list about the reliability of 2-DC set-ups and
just using crush rules to distribute shards doesn't cut it. There
are corner cases that are only handled c
> The main problem was the increase in ram use scaling with PGs, which in
> normal operation is often fine but as we all know balloons in failure
> conditions.
Less so with BlueStore in my experience. I think in part this surfaces a bit of
Filestore legacy that we might re-examine with Filestor
Thanks Anthony and Wesley for your input.
Let me explain in more detail why I'm interested in the somewhat obscure
looking procedure in step 1.
Whats the difference between "ceph osd reweight" and "ceph osd crush reweight"?
the difference is that command 1 only remaps shards within the same fai
Hi Janne.
> To be fair, this number could just be something vaguely related to
> "spin drives have 100-200 iops" ...
It could be, but is it? Or is it just another rumor? I simply don't see how the
PG count could possibly impact Io load on a disk.
How about this guess: It could be dragged along
Cool. Glad you made it through. ;-)
Regards,
Frédéric.
- Le 9 Oct 24, à 16:46, Alex Rydzewski rydzewski...@gmail.com a écrit :
> Great thanks, Frédéric!
>
> It seems that --yes-i-really-mean-it helped. The cluster rebuilding now
> and I can access my data on it!
>
> On 09.10.24 15:48, Fréd
Hi,
> Is it possible to recover two data chunks out of 2 coding chunks?
Yes.
> What do you think about 2+4, is it a good idea or a bad one?
Some differences between a replica pool and an erasure code pool to consider:
1. If an OSD fails there will be a lot more network traffic between the site
I'm having lots of those warning in all nodes:
CephNodeNetworkPacketDrops
Node ceph-06 experiences packet drop > 0.5% or > 10 packets/s on interface
bond0.
bonds are 802.3ad with Mikrotik switches.
Physical interfaces are OK, without drops. They are Intel e1000 dual PCIe
NICs
Running with kern
> but simply on the physical parameter of IOPS-per-TB (a "figure of merit" that
> is widely underestimate or ignored)
hear hear!
> of HDDs, and having enough IOPS-per-TB to sustain both user and admin
> workload.
Even with SATA SSDs I twice had to expand a cluster to meet SLO long before it
If you are replacing the OSDs with the same size/weight device, I agree
with your reweight approach. I've been doing some similar work myself that
does require crush reweighting to 0 and have been in that headspace.
I did a bit of testing around this:
- Even with the lowest possible reweight an O
> Hi Anthony.
>
>> ... Bump up pg_num on pools and see how the average / P90 ceph-osd process
>> size changes?
>> Grafana FTW. osd_map_cache_size I think defaults to 50 now; I want to say
>> it used to be much higher.
>
> That's not an option. What would help is a-priori information based on
For 1s I thought you were in Florida!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
Yes, this was an old lesson and AFAIK nobody has intentionally pushed the
bounds in a long time because it was a very painful lesson for anybody who
ran into it.
The main problem was the increase in ram use scaling with PGs, which in
normal operation is often fine but as we all know balloons in fa
Hi all,
a hopefully simple question this time. I would like a second opinion on a
procedure for replacing a larger number of disks.
We need to replace about 40 disks distributed over all 12 hosts backing a large
pool with EC 8+3. We can't do it host by host as it would take way too long
(repla
Hi Greg,
thanks for chiming in here.
> ... presumably because the current sizing guidelines are generally good
> enough to be getting on with ...
That's exactly why I'm bringing this up with such insistence. The guidelines
are *not* good enough for EC pools on large HDDs that store a high perc
On Wed, Oct 9, 2024 at 7:28 AM 李明 wrote:
> Hello,
>
> ceph version is 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
> nautilus (stable)
>
> and the rbd info command is also slow, some times it needs 6 seconds. rbd
> snap create command takes 17 seconds. There is another cluster with the
>
> [... number of PGs per OSD ...]
> So it is mainly related to PG size.
Indeed and secondarily number of objects: many objects per PG
mean lower metadata overhead, but bigger PGs mean higher admin
workload latency.
>> Note: HDDs larger than 1TB are not really suitable for
>> significant parallel
We identified a potential rgw data loss situation on versioned bucket in
multisite settings. Please see the tracker for details:
https://tracker.ceph.com/issues/68466.
This is affecting Reef, Squid and main. Earlier versions have not been tested
though.
I manage a 4 hosts cluster on Ubuntu 22.04 LTS with ceph installed trough
cephad and containers on Docker.
Last month, I've migrated to the latest Ceph 19.2. All went great.
Last week I've upgraded one of my hosts to Ubuntu 24.04.1 LTS
Now I get the following warning in cephadm shell -- ceph sta
29 matches
Mail list logo