[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-23 Thread Eugen Block
Cc: Dan van der Ster; Patrick Donnelly; Bailey Allison; Spencer Macphee Subject: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Hi all, we took a log with setting debug_journaler=20 and managed to track the deadlock down to line https://github.com/ceph/ceph/blob/pacific/s

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-23 Thread Frank Schilder
nk Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Sunday, January 19, 2025 5:35 PM To: ceph-users@ceph.io Cc: Dan van der Ster; Patrick Donnelly; Bailey Allison; Spencer Macphee Subject: [ceph-users] Re: Help needed, ceph fs down due to

[ceph-users] Re: Help needed: s3cmd set ACL command possess S3 error: 400 (InvalidArgument) in squid ceph version.

2025-01-20 Thread Saif Mohammad
Thanks Stephan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help needed: s3cmd set ACL command possess S3 error: 400 (InvalidArgument) in squid ceph version.

2025-01-20 Thread Stephan Hohn
Hi Mohammad, this seems to be a bug in the current squid version. https://tracker.ceph.com/issues/69527 Cheers Stephan Am Mo., 20. Jan. 2025 um 11:56 Uhr schrieb Saif Mohammad < samdto...@gmail.com>: > Hello Community, > > We are trying to set ACL for one of the objects by s3cmd tool within t

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-19 Thread Frank Schilder
are affected. Thanks for your help and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Saturday, January 18, 2025 2:21 PM To: Frédéric Nass; ceph-users@ceph.io Cc: Dan van der Ster; Patrick

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-18 Thread Frank Schilder
Hi all, looking at the log data (see snippet at end) we suspect a classic "producer–consumer" deadlock since it seems that the same thread that is filling the purge queue at PurgeQueue.cc:L335:journaler.append_entry(bl) in function PurgeQueue::push is also responsible for processing it but the

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-15 Thread Frank Schilder
ceph-users@ceph.io Cc: Dan van der Ster; Patrick Donnelly; Bailey Allison; Spencer Macphee Subject: Re: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Hi Frank, More than ever. You should open a tracker and post debug logs there so anyone can have a look. Regards

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-13 Thread Frédéric Nass
; Spencer Macphee Objet : [ceph-users] Re: Help needed, ceph fs down due to large stray dir Dear all, a quick update and some answers. We set up a dedicated host for running an MDS and debugging the problem. On this host we have 750G RAM, 4T swap and 4T log, both on fast SSDs. Plan is to monitor

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-13 Thread Frank Schilder
Dear all, a quick update and some answers. We set up a dedicated host for running an MDS and debugging the problem. On this host we have 750G RAM, 4T swap and 4T log, both on fast SSDs. Plan is to monitor with "perf top" the MDS becoming the designated MDS for the problematic rank and also pull

[ceph-users] Re: Help in recreating a old ceph cluster

2025-01-12 Thread Eugen Block
Hi, basically it's that easy [0] when only one or few hosts are reinstalled but the cluster is otherwise operative: ceph cephadm osd activate ... If your cluster has lost all monitors, it can get difficult. You can rebuild the mon store [1] by collecting required information from *ALL* O

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frédéric Nass
he current state the MDS is in, but you may want to consider this move if you can. Regards, Frédéric. De : Frank Schilder Envoyé : dimanche 12 janvier 2025 00:07 À : Eugen Block Cc: ceph-users@ceph.io Objet : [ceph-users] Re: Help needed, ceph fs down due to

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
10:43 PM To: Eugen Block Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Hi Eugen, thanks and yes, let's try one thing at a time. I will report back. Best regards, = Frank Schilder AIT Risø Campus Bygning 10

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
ceph-users@ceph.io Subject: Re: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Personally, I would only try one change at a time and wait for a result. Otherwise it can get difficult to tell what exactly helped and what not. I have never played with auth_service_ticket_ttl yet, so

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Eugen Block
ng 109, rum S14 From: Eugen Block Sent: Saturday, January 11, 2025 7:59 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Hi Frank, not sure if this already has been mentioned, but this one has 60 seconds timeout: mds_beacon_mon_down_gra

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
ceph-users@ceph.io Subject: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Hi Frank, not sure if this already has been mentioned, but this one has 60 seconds timeout: mds_beacon_mon_down_grace ceph config help mds_beacon_mon_down_grace mds_beacon_mon_down_grace - toleran

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Eugen Block
From: Frank Schilder Sent: Saturday, January 11, 2025 12:46 PM To: Dan van der Ster Cc: Bailey Allison; ceph-users@ceph.io Subject: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Hi all, my hopes are down again. The MDS might look busy but I'm not sure its doing any

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
-users@ceph.io Subject: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Hi all, my hopes are down again. The MDS might look busy but I'm not sure its doing anything interesting. I now see a lot of these in the log (stripped the heartbeat messages): 2025-01-11T12:35:50.712

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
m S14 From: Frank Schilder Sent: Saturday, January 11, 2025 11:41 AM To: Dan van der Ster Cc: Bailey Allison; ceph-users@ceph.io Subject: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Hi all, new update: after sleeping after the final MDS restart the MDS is

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
regards! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Saturday, January 11, 2025 2:36 AM To: Dan van der Ster Cc: Bailey Allison; ceph-users@ceph.io Subject: [ceph-users] Re: Help needed, ceph fs down due to

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Frank Schilder
he MDS idle yet unresponsive". Thanks for your help so far! Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: Saturday, January 11, 2025 3:04 AM To: Frank Schilder Cc: Bailey Allison; ceph

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Dan van der Ster
some progress with > trimming the stray items? However, I can't do 850 restarts in this fashion, > there has to be another way. > > I would be really grateful for any help regarding getting he system in a > stable state for further trouble shooting. I would really block all cl

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Frank Schilder
ystem and trim the stray items is dearly needed. Alternatively, is there a way to do off-line trimming? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Dan van der Ster Sent: Friday, January 10, 2025 11:32 PM To: Frank Schilder Cc: Bailey Allison; cep

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Dan van der Ster
> Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ____ > From: Bailey Allison > Sent: Friday, January 10, 2025 10:23 PM > To: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] Re: Help needed, ceph fs down due

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Bailey Allison
+1 to this, and the doc mentioned. Just be aware depending on version the heartbeat grace parameter is different, I believe for 16 and below it's the one I mentioned, and it's to be set on the mon level, and for 17 and newer it is what Spencer mentioned. The doc he has provided also mentions s

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Spencer Macphee
mds_beacon_grace is, perhaps confusingly, not an MDS configuration. It's applied to MONs. As you've injected it into the MDS that is likely why the heartbeat is still failing: This has the effect of having the MDS continue to send beacons to the monitors even when its internal "heartbeat" mechanis

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Spencer Macphee
You could try some of the steps here Frank: https://docs.ceph.com/en/quincy/cephfs/troubleshooting/#avoiding-recovery-roadblocks mds_heartbeat_reset_grace is probably the only one really relevant to your scenario. On Fri, Jan 10, 2025 at 1:30 PM Frank Schilder wrote: > Hi all, > > we seem to ha

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Frank Schilder
oceed. Thanks so far and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Bailey Allison Sent: Friday, January 10, 2025 10:23 PM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: Help needed, ceph f

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Bailey Allison
___ From: Bailey Allison Sent: Friday, January 10, 2025 10:05 PM To: ceph-users@ceph.io; Frank Schilder Subject: Re: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Frank, You mentioned previously a large number of strays on the mds rank. Are you able to che

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Frank Schilder
sø Campus Bygning 109, rum S14 From: Bailey Allison Sent: Friday, January 10, 2025 10:05 PM To: ceph-users@ceph.io; Frank Schilder Subject: Re: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Frank, You mentioned previously a large number of strays on the mds r

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Bailey Allison
Frank, You mentioned previously a large number of strays on the mds rank. Are you able to check the rank again to see how many strays there are again? We've previously had a similar issue, and once the MDS came back up we had to stat the filesystem to decrease the number of strays, and which

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Frank Schilder
Hi all, I got the MDS up. however, after quite some time its sitting with almost no CPU load: top - 21:40:02 up 2:49, 1 user, load average: 0.00, 0.02, 0.34 Tasks: 606 total, 1 running, 247 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.1 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.0

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Bailey Allison
HI Frank, What is the state of the mds currently? We are probably at a point where we do a bit of hope and waiting for it to come back up. Regards, Bailey Allison Service Team Lead 45Drives, Ltd. 866-594-7199 x868 On 1/10/25 15:51, Frank Schilder wrote: Hi all, I seem to have gotten the MD

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Frank Schilder
Hi all, I seem to have gotten the MDS up to the point that it reports stats. Does this mean anything: 2025-01-10T20:50:25.256+0100 7f87ccd5f700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.00954s 2025-01-10T20:50:25.256+0100 7f87ccd5f700 0 mds.beacon.ceph-12 Skipping beacon

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Spencer Macphee
I had a similar issue some months ago that ended up using around 300 gigabytes of RAM for a similar number of strays. You can get an idea of the strays kicking around by checking the omapkeys of the stray objects in the cephfs metadata pool. Strays are tracked in objects: 600., 601.000

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Frank Schilder
Hi Patrick and others, thanks for your fast reply. The problem we are in comes from forward scrub ballooning and the memory overuse did not go away even after aborting the scrub. The "official" way to evaluate strays I got from Neha was to restart the rank. I did not expect that the MDS needs

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Bailey Allison
Hi Frank, Are you able to share any logs from the mds that's crashing? And just to confirm the rank goes into up:active before eventually OOM ? This sounds familiar-ish but i'm also recovering after a nearly 24 hour bender of another ceph related recovery.trying to rack my brain of simil

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-10 Thread Patrick Donnelly
Hi Frank, On Fri, Jan 10, 2025 at 12:31 PM Frank Schilder wrote: > > Hi all, > > we seem to have a serious issue with our file system, ceph version is pacific > latest. After a large cleanup operation we had an MDS rank with 100Mio stray > entries (yes, one hundred million). Today we restarted

[ceph-users] Re: Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

2024-10-17 Thread Stefan Kooman
On 17-10-2024 15:16, Nico Schottelius wrote: Stefan Kooman writes: On 16-10-2024 03:02, Harry G Coin wrote: Thanks for the notion!  I did that, the result was no change to the problem, but with the added ceph -s complaint "Public/cluster network defined, but can not be found on any host"  --

[ceph-users] Re: Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

2024-10-17 Thread Nico Schottelius
Stefan Kooman writes: > On 16-10-2024 03:02, Harry G Coin wrote: >> Thanks for the notion!  I did that, the result was no change to the >> problem, but with the added ceph -s complaint "Public/cluster >> network defined, but can not be found on any host"  -- with >> otherwise totally normal clust

[ceph-users] Re: Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

2024-10-17 Thread Stefan Kooman
On 16-10-2024 03:02, Harry G Coin wrote: Thanks for the notion!  I did that, the result was no change to the problem, but with the added ceph -s complaint "Public/cluster network defined, but can not be found on any host"  -- with otherwise totally normal cluster operations.  Go figure.  How ca

[ceph-users] Re: Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

2024-10-16 Thread Harry G Coin
Hi Frédéric All was normal in v18, after 19.2 the problem remains even though the addresses are different: cluster_network global: fc00:1000:0:b00::/64 public_network global: fc00:1002:c7::/64 Also, after rebooting everything in sequence, it only complains that the 27 osd that are both up,

[ceph-users] Re: Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

2024-10-16 Thread Frédéric Nass
Hi Harry, Do you have a 'cluster_network' set to the same subnet as the 'public_network' like in the issue [1]? Doesn't make much sens setting up a cluster_network when it's not different than the public_network. Maybe that's what triggers the OSD_UNREACHABLE recently coded here [2] (even thoug

[ceph-users] Re: Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

2024-10-15 Thread Harry G Coin
Thanks for the notion!  I did that, the result was no change to the problem, but with the added ceph -s complaint "Public/cluster network defined, but can not be found on any host"  -- with otherwise totally normal cluster operations.  Go figure.  How can ceph -s be so totally wrong, the dashbo

[ceph-users] Re: Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

2024-10-14 Thread Anthony D'Atri
Try failing over to a standby mgr > On Oct 14, 2024, at 9:33 PM, Harry G Coin wrote: > > I need help to remove a useless "HEALTH ERR" in 19.2.0 on a fully dual stack > docker setup with ceph using ip v6, public and private nets separated, with a > few servers. After upgrading from an error

[ceph-users] Re: Help with cephadm bootstrap and ssh private key location

2024-09-23 Thread Adam King
Cybersecurity and Information Assurance > 4 Brindabella Cct > Brindabella Business Park > Canberra Airport, ACT 2609 > > www.raytheonaustralia.com.au > LinkedIn | Twitter | Facebook | Instagram > > -Original Message- > From: Adam King > Sent: Monday, September 23, 202

[ceph-users] Re: Help with cephadm bootstrap and ssh private key location

2024-09-22 Thread Kozakis, Anestis
nce 4 Brindabella Cct Brindabella Business Park Canberra Airport, ACT 2609 www.raytheonaustralia.com.au LinkedIn | Twitter | Facebook | Instagram -Original Message- From: Adam King Sent: Monday, September 23, 2024 8:36 AM To: Kozakis, Anestis Cc: ceph-users Subject: [External] [ceph-user

[ceph-users] Re: Help with cephadm bootstrap and ssh private key location

2024-09-22 Thread Adam King
Cephadm stored the key internally within the cluster and it can be grabbed with `ceph config-key get mgr/cephadm/ssh_identity_key`. As for if you already have keys setup, I'd recommend passing filepaths to those keys to the `--ssh-private-key` and `--ssh-public-key` flags the bootstrap command has

[ceph-users] Re: Help with osd spec needed

2024-08-02 Thread Eugen Block
Hi, if you assigned the SSD to be for block.db it won't be available from the orchestrator's point of view as a data device. What you could try is to manually create a partition or LV on the remaining SSD space and then point the service spec to that partition/LV via path spec. I haven't

[ceph-users] Re: Help needed please ! Filesystem became read-only !

2024-07-14 Thread Olli Rajala
Hi, I believe our KL studio has hit this same bug after deleting a pool that was used only for testing. So, is there any procedure to get rid of those bad journal events and get the mds back to rw state? Thanks, --- Olli Rajala - Lead TD Anima Vitae Ltd. www.anima.fi -

[ceph-users] Re: Help with Mirroring

2024-07-12 Thread Anthony D'Atri
> Hi, > > just one question coming to mind, if you intend to migrate the images > separately, is it really necessary to set up mirroring? You could just 'rbd > export' on the source cluster and 'rbd import' on the destination cluster. That can be slower if using a pipe, and require staging sp

[ceph-users] Re: Help with Mirroring

2024-07-12 Thread Frédéric Nass
- Le 11 Juil 24, à 20:50, Dave Hall kdh...@binghamton.edu a écrit : > Hello. > > I would like to use mirroring to facilitate migrating from an existing > Nautilus cluster to a new cluster running Reef. RIght now I'm looking at > RBD mirroring. I have studied the RBD Mirroring section of th

[ceph-users] Re: Help with Mirroring

2024-07-11 Thread Eugen Block
Hi, just one question coming to mind, if you intend to migrate the images separately, is it really necessary to set up mirroring? You could just 'rbd export' on the source cluster and 'rbd import' on the destination cluster. Zitat von Anthony D'Atri : I would like to use mirroring to

[ceph-users] Re: Help with Mirroring

2024-07-11 Thread Anthony D'Atri
> > I would like to use mirroring to facilitate migrating from an existing > Nautilus cluster to a new cluster running Reef. RIght now I'm looking at > RBD mirroring. I have studied the RBD Mirroring section of the > documentation, but it is unclear to me which commands need to be issued on > ea

[ceph-users] Re: Help needed please ! Filesystem became read-only !

2024-06-04 Thread nbarbier
First, thanks Xiubo for your feedback ! To go further on the points raised by Sake: - How does this happen ? -> There were no preliminary signs before the incident - Is this avoidable? -> Good question, I'd also like to know how! - How to fix the issue ? -> So far, no fix nor workaround from w

[ceph-users] Re: Help needed please ! Filesystem became read-only !

2024-06-04 Thread Sake Ceph
Hi Xiubo Thank you for the explanation! This won't be a issue for us, but made me think twice :) Kind regards, Sake > Op 04-06-2024 12:30 CEST schreef Xiubo Li : > > > On 6/4/24 15:20, Sake Ceph wrote: > > Hi, > > > > A little break into this thread, but I have some questions: > > * How d

[ceph-users] Re: Help needed please ! Filesystem became read-only !

2024-06-04 Thread Xiubo Li
On 6/4/24 15:20, Sake Ceph wrote: Hi, A little break into this thread, but I have some questions: * How does this happen, that the filesystem gets into readonly modus The detail explanation you can refer to the ceph PR: https://github.com/ceph/ceph/pull/55421. * Is this avoidable? * How-

[ceph-users] Re: Help needed please ! Filesystem became read-only !

2024-06-04 Thread Sake Ceph
Hi, A little break into this thread, but I have some questions: * How does this happen, that the filesystem gets into readonly modus * Is this avoidable? * How-to fix the issue, because I didn't see a workaround in the mentioned tracker (or I missed it) * With this bug around, should you use c

[ceph-users] Re: Help needed please ! Filesystem became read-only !

2024-06-03 Thread Xiubo Li
Hi Nicolas, This is a known issue and Venky is working on it, please see https://tracker.ceph.com/issues/63259. Thanks - Xiubo On 6/3/24 20:04, nbarb...@deltaonline.net wrote: Hello, First of all, thanks for reading my message. I set up a Ceph version 18.2.2 cluster with 4 nodes, everythin

[ceph-users] Re: Help needed! First MDs crashing, then MONs. How to recover ?

2024-05-30 Thread Patrick Donnelly
On Tue, May 28, 2024 at 8:54 AM Noe P. wrote: > > Hi, > > we ran into a bigger problem today with our ceph cluster (Quincy, > Alma8.9). > We have 4 filesystems and a total of 6 MDs, the largest fs having > two ranks assigned (i.e. one standby). > > Since we often have the problem of MDs lagging be

[ceph-users] Re: Help with deep scrub warnings

2024-05-23 Thread Sascha Lucas
Hi, just for the archives: On Tue, 5 Mar 2024, Anthony D'Atri wrote: * Try applying the settings to global so that mons/mgrs get them. Setting osd_deep_scrub_interval at global instead at osd immediately turns health to OK and removes the false warning from PGs not scrubbed in time. HTH,

[ceph-users] Re: Help with deep scrub warnings (probably a bug ... set on pool for effect)

2024-03-05 Thread Peter Maloney
I had the same problem as you The only solution that worked for me is to set it on the pools:     for pool in $(ceph osd pool ls); do     ceph osd pool set "$pool" scrub_max_interval "$smaxi"     ceph osd pool set "$pool" scrub_min_interval "$smini"     ceph osd pool set "$pool" d

[ceph-users] Re: Help with deep scrub warnings

2024-03-05 Thread Nicola Mori
Hi Anthony, thanks for the tips. I reset all the values but osd_deep_scrub_interval to their defaults as reported at https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/ : # ceph config set osd osd_scrub_sleep 0.0 # ceph config set osd osd_scrub_load_threshold 0.5 # ceph config

[ceph-users] Re: Help with deep scrub warnings

2024-03-05 Thread Anthony D'Atri
* Try applying the settings to global so that mons/mgrs get them. * Set your shallow scrub settings back to the default. Shallow scrubs take very few resources * Set your randomize_ratio back to the default, you’re just bunching them up * Set the load threshold back to the default, I can’t ima

[ceph-users] Re: Help: Balancing Ceph OSDs with different capacity

2024-02-07 Thread Jasper Tan
Hi Anthony and everyone else We have found the issue. Because the new 20x 14 TiB OSDs were onboarded onto a single node, there was not only an imbalance in the capacity of each OSD but also between the nodes (other nodes each have around 15x 1.7TiB). Furthermore, CRUSH rule sets default failure do

[ceph-users] Re: Help: Balancing Ceph OSDs with different capacity

2024-02-07 Thread Anthony D'Atri
> I have recently onboarded new OSDs into my Ceph Cluster. Previously, I had > 44 OSDs of 1.7TiB each and was using it for about a year. About 1 year ago, > we onboarded an additional 20 OSDs of 14TiB each. That's a big difference in size. I suggest increasing mon_max_pg_per_osd to 1000 --

[ceph-users] Re: Help: Balancing Ceph OSDs with different capacity

2024-02-07 Thread Dan van der Ster
Hi Jasper, I suggest to disable all the crush-compat and reweighting approaches. They rarely work out. The state of the art is: ceph balancer on ceph balancer mode upmap ceph config set mgr mgr/balancer/upmap_max_deviation 1 Cheers, Dan -- Dan van der Ster CTO Clyso GmbH p: +49 89 215252722 |

[ceph-users] Re: Help on rgw metrics (was rgw_user_counters_cache)

2024-01-31 Thread Casey Bodley
On Wed, Jan 31, 2024 at 3:43 AM garcetto wrote: > > good morning, > i was struggling trying to understand why i cannot find this setting on > my reef version, is it because is only on latest dev ceph version and not > before? that's right, this new feature will be part of the squid release. we

[ceph-users] Re: Help needed with Grafana password

2023-11-10 Thread Sake Ceph
Thank you Eugen! This worked :) > Op 09-11-2023 14:55 CET schreef Eugen Block : > > > It's the '#' character, everything after (including '#' itself) is cut > off. I tried with single and double quotes which also failed. But as I > already said, use a simple password and then change it with

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Eugen Block
It's the '#' character, everything after (including '#' itself) is cut off. I tried with single and double quotes which also failed. But as I already said, use a simple password and then change it within grafana. That way you also don't have the actual password lying around in clear text in

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Eugen Block
I just tried it on a 17.2.6 test cluster, although I don't have a stack trace the complicated password doesn't seem to be applied (don't know why yet). But since it's an "initial" password you can choose something simple like "admin", and during the first login you are asked to change it an

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph
I tried everything at this point, even waited a hour, still no luck. Got it 1 time accidentally working, but with a placeholder for a password. Tried with correct password, nothing and trying again with the placeholder didn't work anymore. So I thought to switch the manager, maybe something is

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Eugen Block
Usually, removing the grafana service should be enough. I also have this directory (custom_config_files/grafana.) but it's empty. Can you confirm that after running 'ceph orch rm grafana' the service is actually gone ('ceph orch ls grafana')? The directory underneath /var/lib/ceph/{fsid}/gr

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph
Using podman version 4.4.1 on RHEL 8.8, Ceph 17.2.7 I used 'podman system prune -a -f' and 'podman volume prune -f' to cleanup files, but this leaves a lot of files over in /var/lib/containers/storage/overlay and a empty folder /var/lib/ceph//custom_config_files/grafana.. Found those files with

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Eugen Block
What doesn't work exactly? For me it did... Zitat von Sake Ceph : To bad, that doesn't work :( Op 09-11-2023 09:07 CET schreef Sake Ceph : Hi, Well to get promtail working with Loki, you need to setup a password in Grafana. But promtail wasn't working with the 17.2.6 release, the URL was

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph
To bad, that doesn't work :( > Op 09-11-2023 09:07 CET schreef Sake Ceph : > > > Hi, > > Well to get promtail working with Loki, you need to setup a password in > Grafana. > But promtail wasn't working with the 17.2.6 release, the URL was set to > containers.local. So I stopped using it, bu

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph
Hi, Well to get promtail working with Loki, you need to setup a password in Grafana. But promtail wasn't working with the 17.2.6 release, the URL was set to containers.local. So I stopped using it, but forgot to click on save in KeePass :( I didn't configure anything special in Grafana, the

[ceph-users] Re: Help needed with Grafana password

2023-11-08 Thread Eugen Block
Hi, you mean you forgot your password? You can remove the service with 'ceph orch rm grafana', then re-apply your grafana.yaml containing the initial password. Note that this would remove all of the grafana configs or custom dashboards etc., you would have to reconfigure them. So before do

[ceph-users] Re: help, ceph fs status stuck with no response

2023-08-14 Thread Patrick Donnelly
On Tue, Aug 8, 2023 at 1:18 AM Zhang Bao wrote: > > Hi, thanks for your help. > > I am using ceph Pacific 16.2.7. > > Before my Ceph stuck at `ceph fs status fsname`, one of my cephfs became > readonly. Probably the ceph-mgr is stuck (the "volumes" plugin) somehow talking to the read-only CephFS

[ceph-users] Re: help, ceph fs status stuck with no response

2023-08-07 Thread Patrick Donnelly
On Mon, Aug 7, 2023 at 6:12 AM Zhang Bao wrote: > > Hi, > > I have a ceph stucked at `ceph --verbose stats fs fsname`. And in the > monitor log, I can found something like `audit [DBG] from='client.431973 -' > entity='client.admin' cmd=[{"prefix": "fs status", "fs": "fsname", > "target": ["mon-mg

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-30 Thread Eugen Block
I created a tracker issue, maybe that will get some attention: https://tracker.ceph.com/issues/61861 Zitat von Michel Jouvin : Hi Eugen, Thank you very much for these detailed tests that match what I observed and reported earlier. I'm happy to see that we have the same understanding of ho

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-19 Thread Eugen Block
Hi, adding the dev mailing list, hopefully someone there can chime in. But apparently the LRC code hasn't been maintained for a few years (https://github.com/ceph/ceph/tree/main/src/erasure-code/lrc). Let's see... Zitat von Michel Jouvin : Hi Eugen, Thank you very much for these detaile

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-19 Thread Michel Jouvin
Hi Eugen, Thank you very much for these detailed tests that match what I observed and reported earlier. I'm happy to see that we have the same understanding of how it should work (based on the documentation). Is there any other way that this list to enter in contact with the plugin developers

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-19 Thread Eugen Block
Hi, I have a real hardware cluster for testing available now. I'm not sure whether I'm completely misunderstanding how it's supposed to work or if it's a bug in the LRC plugin. This cluster has 18 HDD nodes available across 3 rooms (or DCs), I intend to use 15 nodes to be able to recover if o

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-26 Thread Michel Jouvin
Hi,  I realize that the crushmap I attached to one of my email, probably required to understand the discussion here, has been stripped down by mailman. To avoid poluting the thread with a long output, I put it on at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if you are inte

[ceph-users] Re: [Help appreciated] ceph mds damaged

2023-05-25 Thread Justin Li
Hi Patrick, The disaster recovery process with cephfs-data-scan tool didn't fix our MDS issue. It still kept crashing. I've uploaded a detailed MDS log with below ID. The restore procedure below didn't get it working either. Should I set mds_go_bad_corrupt_dentry to false alongside with mds_ab

[ceph-users] Re: [Help appreciated] ceph mds damaged

2023-05-24 Thread Justin Li
Hi Patrick, Thanks for the instructions. We started the MDS recovery scan with below cmds following the link below. The first bit of scan extens has finished and we're waiting on scan inodes. Probably we shouldn't interrupt the process. Once this procedure failed, I'll follow your steps and let

[ceph-users] Re: [Help appreciated] ceph mds damaged

2023-05-24 Thread Patrick Donnelly
Hello Justin, Please do: ceph config set mds debug_mds 20 ceph config set mds debug_ms 1 Then wait for a crash. Please upload the log. To restore your file system: ceph config set mds mds_abort_on_newly_corrupt_dentry false Let the MDS purge the strays and then try: ceph config set mds mds_a

[ceph-users] Re: [Help appreciated] ceph mds damaged

2023-05-23 Thread Justin Li
Hi Patrick, Sorry for keeping bothering you but I found that MDS service kept crashing even cluster shows MDS is up. I attached another log of MDS server - eowyn at below. Look forward to hearing more insights. Thanks a lot. https://drive.google.com/file/d/1nD_Ks7fNGQp0GE5Q_x8M57HldYurPhuN/view

[ceph-users] Re: [Help appreciated] ceph mds damaged

2023-05-23 Thread Justin Li
Sorry Patrick, last email was restricted as attachment size. I attached a link for you to download the log. Thanks. https://drive.google.com/drive/folders/1bV_X7vyma_-gTfLrPnEV27QzsdmgyK4g?usp=sharing Justin Li Senior Technical Officer School of Information Technology Faculty of Science, Enginee

[ceph-users] Re: [Help appreciated] ceph mds damaged

2023-05-23 Thread Justin Li
Thanks Patrick. We're making progress! After issuing below cmd (ceph config) you gave me, ceph cluster health shows HEALTH_WARN and mds is back up. However, cephfs can't be mounted showing below error. Ceph mgr portal also show 500 internal error when I try to browse the cephfs folder. I'll be u

[ceph-users] Re: [Help appreciated] ceph mds damaged

2023-05-23 Thread Patrick Donnelly
Hello Justin, On Tue, May 23, 2023 at 4:55 PM Justin Li wrote: > > Dear All, > > After a unsuccessful upgrade to pacific, MDS were offline and could not get > back on. Checked the MDS log and found below. See cluster info from below as > well. Appreciate it if anyone can point me to the right d

[ceph-users] Re: [Help appreciated] ceph mds damaged

2023-05-23 Thread Justin Li
Thanks for replying, Greg. I'll give you a detailed sequence I did on the upgrade at below. Step 1: upgrade ceph mgr and Monitor --- reboot. Then mgr and mon are all up running. Step 2: upgrade one OSD node --- reboot and OSDs are all up. Step 3: upgrade a second OSD node named OSD-node2. I did

[ceph-users] Re: [Help appreciated] ceph mds damaged

2023-05-23 Thread Gregory Farnum
On Tue, May 23, 2023 at 1:55 PM Justin Li wrote: > > Dear All, > > After a unsuccessful upgrade to pacific, MDS were offline and could not get > back on. Checked the MDS log and found below. See cluster info from below as > well. Appreciate it if anyone can point me to the right direction. Thank

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-21 Thread Michel Jouvin
Hi Eugen, My LRC pool is also somewhat experimental so nothing really urgent. If you manage to do some tests that help me to understand the problem I remain interested. I propose to keep this thread for that. Zitat, I shared my crush map in the email you answered if the attachment was not su

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-18 Thread Eugen Block
Hi, I don’t have a good explanation for this yet, but I’ll soon get the opportunity to play around with a decommissioned cluster. I’ll try to get a better understanding of the LRC plugin, but it might take some time, especially since my vacation is coming up. :-) I have some thoughts about th

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-17 Thread Curt
Hi, I've been following this thread with interest as it seems like a unique use case to expand my knowledge. I don't use LRC or anything outside basic erasure coding. What is your current crush steps rule? I know you made changes since your first post and had some thoughts I wanted to share, but

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-16 Thread Michel Jouvin
Hi Eugen, Yes, sure, no problem to share it. I attach it to this email (as it may clutter the discussion if inline). If somebody on the list has some clue on the LRC plugin, I'm still interested by understand what I'm doing wrong! Cheers, Michel Le 04/05/2023 à 15:07, Eugen Block a écrit 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-04 Thread Frank Schilder
Subject: [ceph-users] Re: Help needed to configure erasure coding LRC plugin Hi, I don't think you've shared your osd tree yet, could you do that? Apparently nobody else but us reads this thread or nobody reading this uses the LRC plugin. ;-) Thanks, Eugen Zitat von Michel Jouvin : > Hi,

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-04 Thread Eugen Block
Hi, I don't think you've shared your osd tree yet, could you do that? Apparently nobody else but us reads this thread or nobody reading this uses the LRC plugin. ;-) Thanks, Eugen Zitat von Michel Jouvin : Hi, I had to restart one of my OSD server today and the problem showed up again

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-04 Thread Michel Jouvin
Hi, I had to restart one of my OSD server today and the problem showed up again. This time I managed to capture "ceph health detail" output showing the problem with the 2 PGs: [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 pgs down     pg 56.1 is down, acting [208,65,73,

  1   2   >