[ceph-users] Re: Ceph Logging Configuration and "Large omap objects found"

2024-08-14 Thread Janek Bevendorff
Thanks. I increased the number even further and got a (literal) handful of non-debug messages. Unfortunately, none were relevant for the problem I'm trying to debug. On 13/08/2024 14:03, Eugen Block wrote: Interesting, apparently the number one provides in the 'ceph log last ' command is not

[ceph-users] Re: Bluestore issue using 18.2.2

2024-08-14 Thread Eugen Block
Hi, it looks like you're using size 2 pool(s), I strongly advise to increase that to 3 (and min_size = 2). Although it's unclear why the PGs get damaged, the repair of a PG with only two replicas is difficult, which is the correct one? So to avoid that, avoid pools with size 2, except for

[ceph-users] Re: All MDS's Crashed, Failed Assert

2024-08-14 Thread Eugen Block
Hi, have you checked the MDS journal for any damage (replace {CEPHFS} with the name of your filesystem)? cephfs-journal-tool --rank={CEPHFS}:all journal inspect Zitat von m...@silvenga.com: I'm looking for guidance around how to recover after all MDS continue to crash with a failed assert

[ceph-users] Re: Identify laggy PGs

2024-08-14 Thread Eugen Block
Hi, how big are those PGs? If they're huge and are deep-scrubbed, for example, that can cause significant delays. I usually look at 'ceph pg ls-by-pool {pool}' and the "BYTES" column. Zitat von Boris : Hi, currently we encouter laggy PGs and I would like to find out what is causing it. I

[ceph-users] Re: All MDS's Crashed, Failed Assert

2024-08-14 Thread Venky Shankar
On Fri, Aug 2, 2024 at 8:27 AM wrote: > > I'm looking for guidance around how to recover after all MDS continue to > crash with a failed assert during journal replay (no MON damage). > > Context: > > So I've been working through failed MDS for the past day, likely caused by a > large snaptrim op

[ceph-users] Re: Bluestore issue using 18.2.2

2024-08-14 Thread Frank Schilder
Hi Eugen, isn't every shard/replica on every OSD read and written with a checksum? Even if only the primary holds a checksum, it should be possible to identify the damaged shard/replica during deep-scrub (even for replication 1). Apart from that, it is unusual to see a virtual disk have read-er

[ceph-users] Re: Identify laggy PGs

2024-08-14 Thread Boris
PGs are roughtly 35GB. Am Mi., 14. Aug. 2024 um 09:25 Uhr schrieb Eugen Block : > Hi, > > how big are those PGs? If they're huge and are deep-scrubbed, for > example, that can cause significant delays. I usually look at 'ceph pg > ls-by-pool {pool}' and the "BYTES" column. > > Zitat von Boris : >

[ceph-users] Re: Snapshot getting stuck

2024-08-14 Thread Torkil Svensgaard
Hi guys No changes to the network. The Palo Alto firewall is outside our control and we do not have access to logs. I got an off list suggestion to do the following so I guess we'll try that: " Enable fstrim on your vms if using 18.2.1 ? Or look for a config setting make it false. rbd_skip_

[ceph-users] Re: Identify laggy PGs

2024-08-14 Thread Szabo, Istvan (Agoda)
Just curiously I've checked my pg size which is like 150GB, when are we talking about big pgs? From: Eugen Block Sent: Wednesday, August 14, 2024 2:23 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: Identify laggy PGs Email received from the internet. If in

[ceph-users] Re: Bluestore issue using 18.2.2

2024-08-14 Thread Eugen Block
Hi Frank, you may be right about the checksums, but I just wanted to point out the risks of having size 2 pools in general. Since there was no response to the thread yet, I wanted to bump it a bit. Zitat von Frank Schilder : Hi Eugen, isn't every shard/replica on every OSD read and writt

[ceph-users] Re: Identify laggy PGs

2024-08-14 Thread Eugen Block
Hi Boris, PGs are roughtly 35GB. that's not huge. You wrote you drained one OSD which helped with the flapping, so you don't have flapping OSDs anymore at all? If you have identified problematic PGs, you can get the OSD mapping like this: ceph pg map 26.7 osdmap e14121 pg 26.7 (26.7) -> u

[ceph-users] Re: Upgrading RGW before cluster?

2024-08-14 Thread Brett Niver
I don't know for sure, but RGWs use of object classes may have something to do with this recommendation. Brett On Wed, Aug 14, 2024 at 2:53 AM Eugen Block wrote: > Hi Thomas, > > I agree, from my point of view this shouldn't be an issue. And > although I usually stick to the documented process,

[ceph-users] Re: Ceph Logging Configuration and "Large omap objects found"

2024-08-14 Thread Eugen Block
Hm, then I don't see another way than to scan each OSD host for the omap message. Do you have a centralized logging or some configuration management like salt where you can target all hosts with a command? Zitat von Janek Bevendorff : Thanks. I increased the number even further and got a (li

[ceph-users] Re: Upgrading RGW before cluster?

2024-08-14 Thread Anthony D'Atri
> there are/were customers who had services colocated, for example MON, MGR and > RGW on the same nodes. Before cephadm when they upgraded the first MON node > they automatically upgraded the RGW as well, of course. This is one of the arguments in favor of containerized daemons. Strictly sp

[ceph-users] Re: [EXTERNAL] Re: Cephadm and the "--data-dir" Argument

2024-08-14 Thread Alex Hussein-Kershaw (HE/HIM)
Having it locked in from bootstrap seems like a fair compromise to me, especially if this is well documented and in-line with other config attributes. Do feel free to reach out if you would like me to have a go at this. Thanks 🙂 From: Adam King Sent: Monday, Augu

[ceph-users] Cephadm Upgrade Issue

2024-08-14 Thread Alex Hussein-Kershaw (HE/HIM)
Hi Folks, I'm prototyping the upgrade process for our Ceph Clusters. I've adopted the Cluster following the docs, that works nicely 🙂 I then load my docker image into a locally running container registry, as I'm in a disconnected environment. I have a test Cluster with 3 VMs and no data, adopt

[ceph-users] rbd du USED greater than PROVISIONED

2024-08-14 Thread Murilo Morais
Good morning everyone! I am confused about the listing of the total amount used on a volume. It says that more than the amount provisioned is being used. The image contains a snapshot. Below is the output of the "rbd du" command: user@abc2:~# rbd info osr1_volume_ssd/volume-6e5f90ac-78e9-465e-870

[ceph-users] Re: rbd du USED greater than PROVISIONED

2024-08-14 Thread Anthony D'Atri
> On Aug 14, 2024, at 10:45 AM, Murilo Morais wrote: > > Good morning everyone! > > I am confused about the listing of the total amount used on a volume. > It says that more than the amount provisioned is being used. The command shows 710GB of changes since the snapshot was taken, added to th

[ceph-users] Re: squid 19.1.1 RC QE validation status

2024-08-14 Thread Yuri Weinstein
Still waiting to hear back: rgw - Eric, Adam E quincy-x, reef-x - Laura, Neha powercycle - Brad crimson-rados - Matan, Samuel ceph-volume - Guillaume On Tue, Aug 13, 2024 at 9:27 PM Venky Shankar wrote: > > Hi Yuri, > > On Tue, Aug 6, 2024 at 2:03 AM Yuri Weinstein wrote: > > > > Details of thi

[ceph-users] Re: squid 19.1.1 RC QE validation status

2024-08-14 Thread Matan Breizman
Crimson approved (Failures are known). On Wed, Aug 14, 2024 at 6:05 PM Yuri Weinstein wrote: > Still waiting to hear back: > > rgw - Eric, Adam E > quincy-x, reef-x - Laura, Neha > powercycle - Brad > crimson-rados - Matan, Samuel > ceph-volume - Guillaume > > On Tue, Aug 13, 2024 at 9:27 PM Ven

[ceph-users] Re: rbd du USED greater than PROVISIONED

2024-08-14 Thread Murilo Morais
I had forgotten that the snapshot contains data that was removed. Now everything makes sense. Thank you very much! Em qua., 14 de ago. de 2024 às 11:53, Anthony D'Atri escreveu: > > > > > On Aug 14, 2024, at 10:45 AM, Murilo Morais wrote: > > > > Good morning everyone! > > > > I am confused abo

[ceph-users] Re: Cephadm Upgrade Issue

2024-08-14 Thread Adam King
I don't think pacific has the upgrade error handling work so it's a bit tougher to debug here. I think it should have printed a traceback into the logs though. Maybe right after it crashes if you check `ceph log last 200 cephadm` there might be something. If not, you might need to do a `ceph mgr fa

[ceph-users] Re: Cephadm Upgrade Issue

2024-08-14 Thread Alex Hussein-Kershaw (HE/HIM)
I spotted this: Performing a `ceph orch restart mgr` results in endless restart loop | Support | SUSE, which sounded quite similar, so I gave it a go and did: ceph orch daemon rm mgr.raynor-sc-1 < wait a bit for it to be created > < repeat for e

[ceph-users] Re: Cephadm Upgrade Issue

2024-08-14 Thread Eugen Block
A few of our customers were affected by that, but as far as I remember (I can look it up tomorrow), the actual issue popped up if they had more than two MGRs. But I believe it was resolved in a newer pacific version (don’t have the exact version on mind), which version did you try to upgrad

[ceph-users] Re: Cephadm Upgrade Issue

2024-08-14 Thread Adam King
If you're referring to https://tracker.ceph.com/issues/57675, it got into 16.2.14, although there was another issue where running a `ceph orch restart mgr` or `ceph orch redeploy mgr` would cause an endless loop of the mgr daemons restarting, which would block all operations, that might be what we

[ceph-users] Re: squid 19.1.1 RC QE validation status

2024-08-14 Thread Adam Emerson
On 14/08/2024, Yuri Weinstein wrote: > Still waiting to hear back: > > rgw - Eric, Adam E Approved. (Sorry, I thought we were supposed to reply on the tracker.) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-user

[ceph-users] Re: Cephadm Upgrade Issue

2024-08-14 Thread Eugen Block
Of course, I didn’t really think that through. 😄 I believe we had to use the workaround to upgrade one mgr manually as you already mentioned, and after that all went well. Thanks! Zitat von Adam King : If you're referring to https://tracker.ceph.com/issues/57675, it got into 16.2.14, althou

[ceph-users] Re: squid 19.1.1 RC QE validation status

2024-08-14 Thread Laura Flores
Hey @Yuri Weinstein , We've fixed a couple of issues and now need a few things rerun. 1. *Can you please rerun upgrade/reef-x and upgrade/quincy-x? * - Reasoning: Many jobs in those suites died due to https://tracker.ceph.com/issues/66883, which we deduced was a recent merge

[ceph-users] Re: squid 19.1.1 RC QE validation status

2024-08-14 Thread Brad Hubbard
On Tue, Aug 6, 2024 at 6:33 AM Yuri Weinstein wrote: > > Details of this release are summarized here: > > https://tracker.ceph.com/issues/67340#note-1 > > Release Notes - N/A > LRC upgrade - N/A > Gibba upgrade -TBD > > Seeking approvals/reviews for: > > rados - Radek, Laura (https://github.com/ce

[ceph-users] Bug with Cephadm module osd service preventing orchestrator start

2024-08-14 Thread Benjamin Huth
Hey there, so I went to upgrade my ceph from 18.2.2 to 18.2.4 and have encountered a problem with my managers. After they had been upgraded, my ceph orch module broke because the cephadm module would not load. This obviously halted the update because you can't really update without the orchestrator