[ceph-users] Re: Inconsistent PGs after upgrade to Pacific

2022-06-24 Thread Dan van der Ster
Hi, It's trivial to reproduce. Running 16.2.9 with max_mds=2, take a pool snapshot of the meta pool, then decrease to max_mds=1, then deep scrub each meta pg. In my test I could list and remove the pool snap, then deep-scrub again cleared the inconsistencies. https://tracker.ceph.com/issues/5638

[ceph-users] Re: cephadm permission denied when extending cluster

2022-06-24 Thread Robert Reihs
Hi, I tested with the 17.2.1 release with a non root ssh user and it worked fine. Best Robert On Thu, Jun 23, 2022 at 9:12 PM Robert Reihs wrote: > Thanks for the help, It was the same issue. > > Best > Robert > > On Thu, Jun 23, 2022 at 8:37 PM Robert Reihs > wrote: > >> Hi Adam, >> >> Yes loo

[ceph-users] Re: Inconsistent PGs after upgrade to Pacific

2022-06-24 Thread Pascal Ehlert
Hi Dan, Thank you so much for going through the effort of reproducing this! I was just about to plan how to bring up a test cluster but it would've taken me much longer. While I totally assume this is the root cause for our issues, there is one small difference. rados lssnap does not list an

[ceph-users] Re: Inconsistent PGs after upgrade to Pacific

2022-06-24 Thread Dan van der Ster
Hi Pascal, I'm not sure why you don't see that snap, and I'm also not sure if you can just delete the objects directly. BTW, does your CephFS have snapshots itself (e.g. create via mkdir .snap/foobar)? Cheers, Dan On Fri, Jun 24, 2022 at 10:34 AM Pascal Ehlert wrote: > > Hi Dan, > > Thank you s

[ceph-users] Re: Inconsistent PGs after upgrade to Pacific

2022-06-24 Thread Pascal Ehlert
Hi Dan, Just a quick addition here: I have not used the rados command to create the snapshot but "ceph osd pool mksnap $POOL $SNAPNAME" - which I think is the same internally? And yes, our CephFS has numerous snapshots itself for backup purposes. Cheers, Pascal Dan van der Ster wrote on

[ceph-users] Re: Inconsistent PGs after upgrade to Pacific [EXT]

2022-06-24 Thread Dave Holland
Hi, I can't comment on the CephFS side but "Too many repaired reads on 2 OSDs" makes me suggest you check the hardware -- when I've seen that recently it was due to failing HDDs. I say "failing" not "failed" because the disks were giving errors on a few sectors but most I/O was working OK, so neit

[ceph-users] Re: Inconsistent PGs after upgrade to Pacific [EXT]

2022-06-24 Thread Pascal Ehlert
Hi Dave, We have checked the hardware and it seems fine. The same OSDs host numerous other PGs which are unaffected by this issue. All of the OSDs reported as inconsistent/repair_failed belong to the same metadata pool. We did run a `ceph repair` on the initially which is when the "to many rep

[ceph-users] Re: Inconsistent PGs after upgrade to Pacific

2022-06-24 Thread Dan van der Ster
Hi, >From what I can tell, the ceph osd pool command is indeed the same as rados mksnap. But bizarrely I just created a new snapshot, changed max_mds, then removed the snap -- this time I can't manage to "fix" the inconsistency. It may be that my first test was so simple (no client IO, no fs snap

[ceph-users] Re: Inconsistent PGs after upgrade to Pacific

2022-06-24 Thread Pascal Ehlert
Thanks again to everyone and you in particular, Dan. I'll follow the tracker then before trying anything else! Cheers, Pascal Dan van der Ster wrote on 24.06.22 11:41: Hi, From what I can tell, the ceph osd pool command is indeed the same as rados mksnap. But bizarrely I just created a new

[ceph-users] Re: How to remove TELEMETRY_CHANGED( Telemetry requires re-opt-in) message

2022-06-24 Thread Yaarit Hatuka
Hi Matthew, Thanks for your update. How big is the cluster? Thanks for opting-in to telemetry! Yaarit On Thu, Jun 23, 2022 at 11:53 PM Matthew Darwin wrote: > Sorry. Eventually it goes away. Just slower than I was expecting. > > On 2022-06-23 23:42, Matthew Darwin wrote: > > > > I just updat

[ceph-users] Re: How to remove TELEMETRY_CHANGED( Telemetry requires re-opt-in) message

2022-06-24 Thread Matthew Darwin
Thanks Yaarit, The cluster I was using is just a test cluster with a few OSD and almost no data. Not sure why I have to re-opt in upgrading from 17.2.0 to 17.2.1 On 2022-06-24 09:41, Yaarit Hatuka wrote: Hi Matthew, Thanks for your update. How big is the cluster? Thanks for opting-in to te

[ceph-users] Re: How to remove TELEMETRY_CHANGED( Telemetry requires re-opt-in) message

2022-06-24 Thread Yaarit Hatuka
We added a new collection in 17.2.1 to indicate Rook deployments, since we want to understand its volume in the wild, thus the module asks for re-opting-in. On Fri, Jun 24, 2022 at 9:52 AM Matthew Darwin wrote: > Thanks Yaarit, > > The cluster I was using is just a test cluster with a few OSD an

[ceph-users] Re: librbd leaks memory on crushmap updates

2022-06-24 Thread Peter Lieven
Am 23.06.22 um 12:59 schrieb Ilya Dryomov: > On Thu, Jun 23, 2022 at 11:32 AM Peter Lieven wrote: >> Am 22.06.22 um 15:46 schrieb Josh Baergen: >>> Hey Peter, >>> I found relatively large allocations in the qemu smaps and checked the contents. It contained several hundred repetitions of

[ceph-users] Re: How to remove TELEMETRY_CHANGED( Telemetry requires re-opt-in) message

2022-06-24 Thread Laura Flores
Hi Matthew, About how long did the warning stay up after you ran the `ceph telemetry on` command? - Laura On Fri, Jun 24, 2022 at 9:03 AM Yaarit Hatuka wrote: > We added a new collection in 17.2.1 to indicate Rook deployments, since we > want to understand its volume in the wild, thus the modu

[ceph-users] Re: How to remove TELEMETRY_CHANGED( Telemetry requires re-opt-in) message

2022-06-24 Thread Matthew Darwin
Not sure.  Long enough to try the command and write this email, so at least 10 minutes. I expected it to disappear after 30 seconds or so. On 2022-06-24 10:34, Laura Flores wrote: Hi Matthew, About how long did the warning stay up after you ran the `ceph telemetry on` command? - Laura On

[ceph-users] Re: How to remove TELEMETRY_CHANGED( Telemetry requires re-opt-in) message

2022-06-24 Thread Robert Sander
Am 24.06.22 um 16:44 schrieb Matthew Darwin: Not sure.  Long enough to try the command and write this email, so at least 10 minutes. I had that too today after upgrading my test cluster. I just ran "ceph telemetry off" and "ceph telemetry on" and the message was gone. Regards -- Robert Sand

[ceph-users] Ceph recovery network speed

2022-06-24 Thread Curt
Hello, I'm trying to understand why my recovery is so slow with only 2 pg backfilling. I'm only getting speeds of 3-4/MiB/s on a 10G network. I have tested the speed between machines with a few tools and all confirm 10G speed. I've tried changing various settings of priority and recovery sleep

[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
2 PG's shouldn't take hours to backfill in my opinion. Just 2TB enterprise HD's. Take this log entry below, 72 minutes and still backfilling undersized? Should it be that slow? pg 12.15 is stuck undersized for 72m, current state active+undersized+degraded+remapped+backfilling, last acting [34,10

[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
Pool 12 is my erasure coding pool, 2+2. How can I tell if it's objections or keys recovering? Thanks, Curt On Fri, Jun 24, 2022 at 9:39 PM Stefan Kooman wrote: > On 6/24/22 19:04, Curt wrote: > > 2 PG's shouldn't take hours to backfill in my opinion. Just 2TB > enterprise > > HD's. > > > > Ta

[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
On Fri, Jun 24, 2022 at 10:00 PM Stefan Kooman wrote: > On 6/24/22 19:49, Curt wrote: > > Pool 12 is my erasure coding pool, 2+2. How can I tell if it's > > objections or keys recovering?\ > > ceph -s. wil tell you what type of recovery is going on. > > Is it a cephfs metadata pool? Or a rgw ind

[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
> You wrote 2TB before, are they 2TB or 18TB? Is that 273 PGs total or per osd? Sorry, 18TB of data and 273 PGs total. > `ceph osd df` will show you toward the right how many PGs are on each OSD. If you have multiple pools, some PGs will have more data than others. > So take an average # of PGs

[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
Nope, majority of read/writes happen at night so it's doing less than 1 MiB/s client io right now, sometimes 0. On Fri, Jun 24, 2022, 22:23 Stefan Kooman wrote: > On 6/24/22 20:09, Curt wrote: > > > > > > On Fri, Jun 24, 2022 at 10:00 PM Stefan Kooman > > wrote: > > > >

[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
On Sat, Jun 25, 2022 at 3:27 AM Anthony D'Atri wrote: > The pg_autoscaler aims IMHO way too low and I advise turning it off. > > > > > On Jun 24, 2022, at 11:11 AM, Curt wrote: > > > >> You wrote 2TB before, are they 2TB or 18TB? Is that 273 PGs total or > per > > osd? > > Sorry, 18TB of data a