[ceph-users] Re: [EXTERN] Urgent help with degraded filesystem needed

2024-07-02 Thread Venky Shankar
Hi Stefan,

On Mon, Jul 1, 2024 at 2:30 PM Stefan Kooman  wrote:
>
> Hi Dietmar,
>
> On 29-06-2024 10:50, Dietmar Rieder wrote:
> > Hi all,
> >
> > finally we were able to repair the filesystem and it seems that we did
> > not lose any data. Thanks for all suggestions and comments.
> >
> > Here is a short summary of our journey:
>
> Thanks for writing this up. This might be useful for someone in the future.
>
> --- snip ---
>
> > X. Conclusion:
> >
> > If we would have be aware of the bug and its mitigation we would have
> > saved a lot of downtime and some nerves.
> >
> > Is there an obvious place that I missed were such known issues are
> > prominently made public? (The bug tracker maybe, but I think it is easy
> > to miss the important among all others)
>
>
> Not that I know of. But changes in behavior of Ceph (daemons) and or
> Ceph kernels would be good to know about indeed. I follow the
> ceph-kernel mailing list to see what is going on with the development of
> kernel CephFS. And there is a thread about reverting the PR that Enrico
> linked to [1], here the last mail in that thread from Venky to Ilya [2]:
>
> "Hi Ilya,
>
> After some digging and talking to Jeff, I figured that it's possible
> to disable async dirops from the mds side by setting
> `mds_client_delegate_inos_pct` config to 0:
>
>  - name: mds_client_delegate_inos_pct
>type: uint
>level: advanced
>desc: percentage of preallocated inos to delegate to client
>default: 50
>services:
>- mds
>
> So, I guess this patch is really not required. We can suggest this
> config update to users and document it for now. We lack tests with
> this config disabled, so I'll be adding the same before recommending
> it out. Will keep you posted."
>
> However, I have not seen any update after this. So apparently it is
> possible to disable this preallocate behavior globally by disabling it
> on the MDS. But there are (were) no MDS tests with this option disabled
> (I guess a percentage of "0" would disable it). So I'm not sure it's
> safe to disable it, and what would happen if you disable this on the MDS
> when there are clients actually using preallocated inodes. I have added
> Venky in the CC so I hope he can give us an update about the recommended
> way(s) of disabling preallocated inodes

It's safe to disable preallocation by setting
`mds_client_delegate_inos_pct = 0'. Once disabled, the MDS will not
delegate (preallocated) inode ranges to clients, effectively disabling
async dirops. We have seen users running with this config (and using
the `wsync' mount option in the kernel driver - although setting both
isn't really required IMO, Xiubo?) and reporting stabe file system
operations.

As far as tests are concerned, the way forward is to have a shot
reproducing this in our test lab. We have a tracker for this:

https://tracker.ceph.com/issues/66250

It's likely that a combination of having the MDS preallocate inodes
and delegate a percentage of those frequently to clients is causing
this bug. Furthermore, an enhancement is proposed (for the shorter
term) to not crash the MDS, but to blocklist the client that's holding
the "problematic" preallocated inode range. That way, the file system
isn't totally unavailable when such a problem occurs (the client would
have to be remounted though, but that's a lesser pain than going
through the disaster recovery steps).

HTH.

>
> Gr. Stefan
>
> [1]:
> https://github.com/gregkh/linux/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902
>
> [2]:
> https://lore.kernel.org/all/20231003110556.140317-1-vshan...@redhat.com/T/
>
>
>


-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Leadership Team Meeting, 2024-07-01

2024-07-02 Thread Ernesto Puerta
Hi Cephers,

These are the topics that we discussed in our meeting:

   - [cbodley] rgw tech lead transitioning to Eric Ivancich and Adam Emerson
  - You may send your congrats to them! (offline)
   - [Zac] - last week's unfinished business -
   https://github.com/ceph/ceph/pull/58092
   - [Zac] - https://github.com/ceph/ceph/pull/58323 - "run-tox-cephadm" is
   failing -- this is beyond Zac's art. He needs help if the Reef docs are
   going to update.
  - Adam King is taking care of it.
   - [Zac] - CQ#5 - We stop taking CQ-related requests one week from today.
   - https://pad.ceph.com/p/  ceph_quarterly_2024_07
  - Still 1 week to go
   - [Neha] 19.1.0
  - https://tracker.ceph.com/issues/66756
  - release notes PR/release highlights for announcement:
 - Neha request leads to contribute to
https://pad.ceph.com/p/squid_release_highlights

  - [David] Last Pacific build, connected to CentOS 8 and fs upgrades
   from Pacific-Quincy/Reef.
   - [Radek] Nightly QA runs are taking longer than usual:
  - https://pulpito.ceph.com/?branch=main&suite=rados
  - [Ilya]: we didn't previously have nightly runs,
 - [Patrick] They have lower priority than integration branches.
  - "They're helpful as a baseline"... "but do they need to be run
  nightly?"
  - [yuri] Run rados main tests on weekends with higher prio (manually?)
  - [Patrick]:
  https://github.com/ceph/ceph/blob/main/qa/crontab/teuthology-cronjobs
 - 1 component / day so it's ~ weekly

Kind Regards,
Ernesto Puerta
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: squid 19.1.0 RC QE validation status

2024-07-02 Thread Matan Breizman
crimson-rados approved.
Failure fixes were backported to `squid` branch.

Thanks,
Matan

On Mon, Jul 1, 2024 at 5:23 PM Yuri Weinstein  wrote:

> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/66756#note-1
>
> Release Notes - TBD
> LRC upgrade - TBD
>
> (Reruns were not done yet.)
>
> Seeking approvals/reviews for:
>
> smoke
> rados - Radek, Laura
> rgw- Casey
> fs - Venky
> orch - Adam King
> rbd, krbd - Ilya
> quincy-x, reef-x - Laura, Neha
> powercycle - Brad
> perf-basic - Yaarit, Laura
> crimson-rados - Samuel
> ceph-volume - Guillaume
>
> Pls let me know if any tests were missed from this list.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERN] Urgent help with degraded filesystem needed

2024-07-02 Thread Stefan Kooman

Hi Venky,

On 02-07-2024 09:45, Venky Shankar wrote:

Hi Stefan,

On Mon, Jul 1, 2024 at 2:30 PM Stefan Kooman  wrote:


Hi Dietmar,

On 29-06-2024 10:50, Dietmar Rieder wrote:

Hi all,

finally we were able to repair the filesystem and it seems that we did
not lose any data. Thanks for all suggestions and comments.

Here is a short summary of our journey:


Thanks for writing this up. This might be useful for someone in the future.

--- snip ---


X. Conclusion:

If we would have be aware of the bug and its mitigation we would have
saved a lot of downtime and some nerves.

Is there an obvious place that I missed were such known issues are
prominently made public? (The bug tracker maybe, but I think it is easy
to miss the important among all others)



Not that I know of. But changes in behavior of Ceph (daemons) and or
Ceph kernels would be good to know about indeed. I follow the
ceph-kernel mailing list to see what is going on with the development of
kernel CephFS. And there is a thread about reverting the PR that Enrico
linked to [1], here the last mail in that thread from Venky to Ilya [2]:

"Hi Ilya,

After some digging and talking to Jeff, I figured that it's possible
to disable async dirops from the mds side by setting
`mds_client_delegate_inos_pct` config to 0:

  - name: mds_client_delegate_inos_pct
type: uint
level: advanced
desc: percentage of preallocated inos to delegate to client
default: 50
services:
- mds

So, I guess this patch is really not required. We can suggest this
config update to users and document it for now. We lack tests with
this config disabled, so I'll be adding the same before recommending
it out. Will keep you posted."

However, I have not seen any update after this. So apparently it is
possible to disable this preallocate behavior globally by disabling it
on the MDS. But there are (were) no MDS tests with this option disabled
(I guess a percentage of "0" would disable it). So I'm not sure it's
safe to disable it, and what would happen if you disable this on the MDS
when there are clients actually using preallocated inodes. I have added
Venky in the CC so I hope he can give us an update about the recommended
way(s) of disabling preallocated inodes


It's safe to disable preallocation by setting
`mds_client_delegate_inos_pct = 0'. Once disabled, the MDS will not
delegate (preallocated) inode ranges to clients, effectively disabling
async dirops. We have seen users running with this config (and using
the `wsync' mount option in the kernel driver - although setting both
isn't really required IMO, Xiubo?) and reporting stabe file system
operations.


Can this be done live, as in via `ceph config set mds 
mds_client_delegate_inos_pct = 0` and / or `ceph daemon mds.$hostname 
config set mds_client_delegate_inos_pct = 0`? Has that been tested? Or 
is it safer to do this by restarting the MDS? I wonder how the MDS 
handles cases where inodes are already delegated to the client and has 
to transition to full sync behavior again.




As far as tests are concerned, the way forward is to have a shot
reproducing this in our test lab. We have a tracker for this:

 https://tracker.ceph.com/issues/66250

It's likely that a combination of having the MDS preallocate inodes
and delegate a percentage of those frequently to clients is causing
this bug. Furthermore, an enhancement is proposed (for the shorter
term) to not crash the MDS, but to blocklist the client that's holding
the "problematic" preallocated inode range. That way, the file system
isn't totally unavailable when such a problem occurs (the client would
have to be remounted though, but that's a lesser pain than going
through the disaster recovery steps).


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.3 QE validation status

2024-07-02 Thread Yuri Weinstein
After fixing the issues identified below we cherry-picked all PRs from this
list for 18.2.3 https://pad.ceph.com/p/release-cherry-pick-coordination.

The question to the dev leads: do you think we can proceed with the release
without rerunning suites, as they were already approved?

Please reply with your recommendations.


On Thu, Jun 6, 2024 at 4:59 PM Yuri Weinstein  wrote:

> Please see the update from Laura below for the status of this release.
>
> Dev Leads help is appreciated to expedite fixes necessary to publish it
> soon.
>
> "Hi all, we have hit another blocker with this release.
>
> Due to centos 8 stream going end of life, we only have the option of
> releasing centos 9 stream containers.
>
> However, we did not test the efficacy of centos 9 stream containers
> against the orch and upgrade suites during the initial 18.2.3 release cycle.
>
> This problem is tracked here: https://tracker.ceph.com/issues/66334
>
> What needs to happen now is:
>
> 1.  The orch team needs to fix all references to centos 8 stream in the
> orch suite
> 2.  fs, rados, etc. need to fix their relative jobs the same way in the
> upgrade suite
>
> The easiest way to tackle that is to raise a PR against main and backport
> to stable releases since this problem actually affects main and all other
> releases.
>
> Then, we will:
>
> 1.  Rerun orch and upgrade with these fixes
> 2.  Re-approve orch and upgrade
> 3.  Re-upgrade gibba and LRC
>
> Then the release will be unblocked."
>
> On Tue, Jun 4, 2024 at 3:26 PM Laura Flores  wrote:
>
>> Rados results were approved, and we successfully upgraded the gibba
>> cluster.
>>
>> Now waiting on @Dan Mick  to upgrade the LRC.
>>
>> On Thu, May 30, 2024 at 8:32 PM Yuri Weinstein 
>> wrote:
>>
>>> I reran rados on the fix https://github.com/ceph/ceph/pull/57794/commits
>>> and seeking approvals from Radek and Laure
>>>
>>> https://tracker.ceph.com/issues/65393#note-1
>>>
>>> On Tue, May 28, 2024 at 2:12 PM Yuri Weinstein 
>>> wrote:
>>> >
>>> > We have discovered some issues (#1 and #2) during the final stages of
>>> > testing that require considering a delay in this point release until
>>> > all options and risks are assessed and resolved.
>>> >
>>> > We will keep you all updated on the progress.
>>> >
>>> > Thank you for your patience!
>>> >
>>> > #1 https://tracker.ceph.com/issues/66260
>>> > #2 https://tracker.ceph.com/issues/61948#note-21
>>> >
>>> > On Wed, May 1, 2024 at 3:41 PM Yuri Weinstein 
>>> wrote:
>>> > >
>>> > > We've run into a problem during the last verification steps before
>>> > > publishing this release after upgrading the LRC to it  =>
>>> > > https://tracker.ceph.com/issues/65733
>>> > >
>>> > > After this issue is resolved, we will continue testing and publishing
>>> > > this point release.
>>> > >
>>> > > Thanks for your patience!
>>> > >
>>> > > On Thu, Apr 18, 2024 at 11:29 PM Christian Rohmann
>>> > >  wrote:
>>> > > >
>>> > > > On 18.04.24 8:13 PM, Laura Flores wrote:
>>> > > > > Thanks for bringing this to our attention. The leads have
>>> decided that
>>> > > > > since this PR hasn't been merged to main yet and isn't approved,
>>> it
>>> > > > > will not go in v18.2.3, but it will be prioritized for v18.2.4.
>>> > > > > I've already added the PR to the v18.2.4 milestone so it's sure
>>> to be
>>> > > > > picked up.
>>> > > >
>>> > > > Thanks a bunch. If you miss the train, you miss the train - fair
>>> enough.
>>> > > > Nice to know there is another one going soon and that bug is going
>>> to be
>>> > > > on it !
>>> > > >
>>> > > >
>>> > > > Regards
>>> > > >
>>> > > > Christian
>>> > > > ___
>>> > > > ceph-users mailing list -- ceph-users@ceph.io
>>> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
>>> > > >
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
>>
>> --
>>
>> Laura Flores
>>
>> She/Her/Hers
>>
>> Software Engineer, Ceph Storage 
>>
>> Chicago, IL
>>
>> lflo...@ibm.com | lflo...@redhat.com 
>> M: +17087388804
>>
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS constant high write I/O to the metadata pool

2024-07-02 Thread Olli Rajala
Hi - mostly as a note to future me and if anyone else looking for the same
issue...

I finally solved this a couple of months ago. No idea what is wrong with
Ceph but the root cause that was triggering this MDS issue was that I had
several workstations and a couple servers where the updatedb of "locate"
was getting run by daily cron exactly the same time every night causing
high momentary strain on the MDS which then somehow screwed up the metadata
caching and flushing creating this cumulative write io.

The thing to note here is that there's a difference with "locate" and
"mlocate" packages. The default config (on Ubuntu atleast) of updatedb for
"mlocate" does skip scanning cephfs filesystems but not so for "locate"
which happily ventures onto all of your cephfs mounts :|

---
Olli Rajala - Lead TD
Anima Vitae Ltd.
www.anima.fi
---


On Wed, Dec 14, 2022 at 7:41 PM Olli Rajala  wrote:

> Hi,
>
> One thing I now noticed in the mds logs is that there's a ton of entries
> like this:
> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> [d345,d346] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> 694=484+210)
> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
> [d345,d346] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> 695=484+211)
> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> [d343,d344] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> 694=484+210)
> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
> [d343,d344] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> 695=484+211)
> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> [d341,d342] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> 694=484+210)
> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
> [d341,d342] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> 695=484+211)
> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> [d33f,d340] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> 694=484+210)
> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
> [d33f,d340] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> 695=484+211)
> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> 694=484+210)
> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
> [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> 695=484+211)
>
> ...and after dropping the caches considerably less of those - normal,
> abnormal, typical, atypical? ...or is that just something that starts
> happening after the cache gets filled?
>
> Tnx,
> ---
> Olli Rajala - Lead TD
> Anima Vitae Ltd.
> www.anima.fi
> ---
>
>
> On Sun, Dec 11, 2022 at 9:07 PM Olli Rajala  wrote:
>
>> Hi,
>>
>> I'm still totally lost with this issue. And now lately I've had a couple
>> of incidents where the write bw has suddenly jumped to even crazier levels.
>> See the graph here:
>> https://gist.github.com/olliRJL/3e97e15a37e8e801a785a1bd5358120d
>>
>> The points where it drops to something manageable again are when I have
>> dropped the mds caches. Usually after the drop there is steady rise but now
>> these sudden jumps are something new and even more scary :E
>>
>> Here's a fresh 2sec level 20 mds log:
>> https://gist.github.com/olliRJL/074bec65787085e70db8af0ec35f8148
>>
>> Any help and ideas greatly appreciated. Is there any tool or procedure to
>> safely check or rebuild the mds data? ...if this behaviour could be caused
>> by some hidden issue with the data itself.
>>
>> Tnx,
>> ---
>> Olli Rajala - Lead TD
>> Anima Vitae Ltd.
>> www.anima.fi
>> ---
>>
>>
>> On Fri, Nov 11, 2022 at 9:14 AM Venky Shankar 
>> wrote:
>>
>>> On Fri, Nov 11, 2022 at 3:06 AM Olli Rajala 
>>> wrote:
>>> >
>>> > Hi Venky,
>>> >
>>> > I have indeed observed the output of the different sections of perf
>>> dump like so:
>>> > watch -n 1 ceph tell mds.`hostname` perf dump objecter
>>> > watch -n 1 ceph tell mds.`hostname` perf dump mds_cache
>>> > ...etc...
>>> >
>>> > ...but without any proper understanding of what is a normal rate for
>>> some number to go up it's really difficult to make anything from that.
>>> >
>>> > btw - is there some convenient way to capture this kind of temporal
>>> output for others to view. Sure, I could just dump once a second to a file
>>> or sequential files but is there some tool or convention that is easy to
>>> look at and analyze?
>>>
>>> Not really - you'd have to do it yourself.
>>>
>>> >
>>> > Tnx,
>>> > ---
>>> > Olli Rajala - Lead TD
>>> > Anima Vitae Ltd.
>>> > www.anima.fi
>>> > ---
>>> >
>>> >
>>> > On Thu, Nov 10, 2022 at 8:18 AM Venky Shankar 
>>> wrote:
>>> 

[ceph-users] Re: CephFS constant high write I/O to the metadata pool

2024-07-02 Thread Anthony D'Atri
This was common in the NFS days, and some Linux distribution deliberately slew 
the execution time.  find over an NFS mount was a sure-fire way to horque the 
server. (e.g. Convex C1)

IMHO since the tool relies on a static index it isn't very useful, and I 
routinely remove any variant from my systems.

ymmv

> On Jul 2, 2024, at 10:20, Olli Rajala  wrote:
> 
> Hi - mostly as a note to future me and if anyone else looking for the same
> issue...
> 
> I finally solved this a couple of months ago. No idea what is wrong with
> Ceph but the root cause that was triggering this MDS issue was that I had
> several workstations and a couple servers where the updatedb of "locate"
> was getting run by daily cron exactly the same time every night causing
> high momentary strain on the MDS which then somehow screwed up the metadata
> caching and flushing creating this cumulative write io.
> 
> The thing to note here is that there's a difference with "locate" and
> "mlocate" packages. The default config (on Ubuntu atleast) of updatedb for
> "mlocate" does skip scanning cephfs filesystems but not so for "locate"
> which happily ventures onto all of your cephfs mounts :|
> 
> ---
> Olli Rajala - Lead TD
> Anima Vitae Ltd.
> www.anima.fi
> ---
> 
> 
> On Wed, Dec 14, 2022 at 7:41 PM Olli Rajala  wrote:
> 
>> Hi,
>> 
>> One thing I now noticed in the mds logs is that there's a ton of entries
>> like this:
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d345,d346] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
>> [d345,d346] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d343,d344] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
>> [d343,d344] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d341,d342] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
>> [d341,d342] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d33f,d340] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
>> [d33f,d340] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
>> [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 
>> ...and after dropping the caches considerably less of those - normal,
>> abnormal, typical, atypical? ...or is that just something that starts
>> happening after the cache gets filled?
>> 
>> Tnx,
>> ---
>> Olli Rajala - Lead TD
>> Anima Vitae Ltd.
>> www.anima.fi
>> ---
>> 
>> 
>> On Sun, Dec 11, 2022 at 9:07 PM Olli Rajala  wrote:
>> 
>>> Hi,
>>> 
>>> I'm still totally lost with this issue. And now lately I've had a couple
>>> of incidents where the write bw has suddenly jumped to even crazier levels.
>>> See the graph here:
>>> https://gist.github.com/olliRJL/3e97e15a37e8e801a785a1bd5358120d
>>> 
>>> The points where it drops to something manageable again are when I have
>>> dropped the mds caches. Usually after the drop there is steady rise but now
>>> these sudden jumps are something new and even more scary :E
>>> 
>>> Here's a fresh 2sec level 20 mds log:
>>> https://gist.github.com/olliRJL/074bec65787085e70db8af0ec35f8148
>>> 
>>> Any help and ideas greatly appreciated. Is there any tool or procedure to
>>> safely check or rebuild the mds data? ...if this behaviour could be caused
>>> by some hidden issue with the data itself.
>>> 
>>> Tnx,
>>> ---
>>> Olli Rajala - Lead TD
>>> Anima Vitae Ltd.
>>> www.anima.fi
>>> ---
>>> 
>>> 
>>> On Fri, Nov 11, 2022 at 9:14 AM Venky Shankar 
>>> wrote:
>>> 
 On Fri, Nov 11, 2022 at 3:06 AM Olli Rajala 
 wrote:
> 
> Hi Venky,
> 
> I have indeed observed the output of the different sections of perf
 dump like so:
> watch -n 1 ceph tell mds.`hostname` perf dump objecter
> watch -n 1 ceph tell mds.`hostname` perf dump mds_cache
> ...etc...
> 
> ...but without any proper understanding of what is a normal rate for
 some number to go up it's really difficult to make anything from that.
> 
> btw - is there some convenie

[ceph-users] Ceph 16.2 vs 18.2 use case Docker/Swarm LXC

2024-07-02 Thread filip Mutterer
Could I be missing any significant features when using Ceph 16.2 instead 
of 18.2 when using Docker/Swarm or LXC?


I am asking, because I am struggling to set it up in Version 18.2 due to 
confusion with the debian 12 packages as some still stay in version 16.2 
even after adding the sources for 18.2.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: squid 19.1.0 RC QE validation status

2024-07-02 Thread Ilya Dryomov
On Mon, Jul 1, 2024 at 8:41 PM Ilya Dryomov  wrote:
>
> On Mon, Jul 1, 2024 at 4:24 PM Yuri Weinstein  wrote:
> >
> > Details of this release are summarized here:
> >
> > https://tracker.ceph.com/issues/66756#note-1
> >
> > Release Notes - TBD
> > LRC upgrade - TBD
> >
> > (Reruns were not done yet.)
> >
> > Seeking approvals/reviews for:
> >
> > smoke
> > rados - Radek, Laura
> > rgw- Casey
> > fs - Venky
> > orch - Adam King
> > rbd, krbd - Ilya
>
> Hi Yuri,
>
> Need reruns for rbd and krbd.
>
> After infrastructure failures are cleared in reruns, I'm prepared to
> approve as is, but here is a list of no-brainer PRs that would fix some

rbd approved.

Please do another rerun for krbd.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: squid 19.1.0 RC QE validation status

2024-07-02 Thread Laura Flores
The rados suite, upgrade suite, and powercycle are approved by RADOS.

Failures are summarized here:
https://tracker.ceph.com/projects/rados/wiki/SQUID#Squid-1910

@Ilya Dryomov , please see the upgrade/reef-x suite,
which had this RBD failure:

   - https://tracker.ceph.com/issues/63131 - TestMigration.Stress2: snap3,
   block 171966464~4194304 differs after migration - RBD


@Venky Shankar , please see the powercycle suite,
which had this CephFS failure:

   - https://tracker.ceph.com/issues/64572 - workunits/fsx.sh failure -
   CephFS


On Tue, Jul 2, 2024 at 1:17 PM Ilya Dryomov  wrote:

> On Mon, Jul 1, 2024 at 8:41 PM Ilya Dryomov  wrote:
> >
> > On Mon, Jul 1, 2024 at 4:24 PM Yuri Weinstein 
> wrote:
> > >
> > > Details of this release are summarized here:
> > >
> > > https://tracker.ceph.com/issues/66756#note-1
> > >
> > > Release Notes - TBD
> > > LRC upgrade - TBD
> > >
> > > (Reruns were not done yet.)
> > >
> > > Seeking approvals/reviews for:
> > >
> > > smoke
> > > rados - Radek, Laura
> > > rgw- Casey
> > > fs - Venky
> > > orch - Adam King
> > > rbd, krbd - Ilya
> >
> > Hi Yuri,
> >
> > Need reruns for rbd and krbd.
> >
> > After infrastructure failures are cleared in reruns, I'm prepared to
> > approve as is, but here is a list of no-brainer PRs that would fix some
>
> rbd approved.
>
> Please do another rerun for krbd.
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: squid 19.1.0 RC QE validation status

2024-07-02 Thread Ilya Dryomov
On Tue, Jul 2, 2024 at 9:13 PM Laura Flores  wrote:

> The rados suite, upgrade suite, and powercycle are approved by RADOS.
>
> Failures are summarized here:
> https://tracker.ceph.com/projects/rados/wiki/SQUID#Squid-1910
>
> @Ilya Dryomov , please see the upgrade/reef-x suite,
> which had this RBD failure:
>
>- https://tracker.ceph.com/issues/63131 - TestMigration.Stress2:
>snap3, block 171966464~4194304 differs after migration - RBD
>
>
This is known, it won't be a blocker.

Thanks,

Ilya


> @Venky Shankar , please see the powercycle suite,
> which had this CephFS failure:
>
>- https://tracker.ceph.com/issues/64572 - workunits/fsx.sh failure -
>CephFS
>
>
> On Tue, Jul 2, 2024 at 1:17 PM Ilya Dryomov  wrote:
>
>> On Mon, Jul 1, 2024 at 8:41 PM Ilya Dryomov  wrote:
>> >
>> > On Mon, Jul 1, 2024 at 4:24 PM Yuri Weinstein 
>> wrote:
>> > >
>> > > Details of this release are summarized here:
>> > >
>> > > https://tracker.ceph.com/issues/66756#note-1
>> > >
>> > > Release Notes - TBD
>> > > LRC upgrade - TBD
>> > >
>> > > (Reruns were not done yet.)
>> > >
>> > > Seeking approvals/reviews for:
>> > >
>> > > smoke
>> > > rados - Radek, Laura
>> > > rgw- Casey
>> > > fs - Venky
>> > > orch - Adam King
>> > > rbd, krbd - Ilya
>> >
>> > Hi Yuri,
>> >
>> > Need reruns for rbd and krbd.
>> >
>> > After infrastructure failures are cleared in reruns, I'm prepared to
>> > approve as is, but here is a list of no-brainer PRs that would fix some
>>
>> rbd approved.
>>
>> Please do another rerun for krbd.
>>
>> Thanks,
>>
>> Ilya
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
>
> --
>
> Laura Flores
>
> She/Her/Hers
>
> Software Engineer, Ceph Storage 
>
> Chicago, IL
>
> lflo...@ibm.com | lflo...@redhat.com 
> M: +17087388804
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io