[ceph-users] Re: Ceph Squid released?

2024-04-29 Thread Robert Sander

On 4/29/24 08:50, Alwin Antreich wrote:


well it says it in the article.

The upcoming Squid release serves as a testament to how the Ceph
project continues to deliver innovative features to users without
compromising on quality. 



I believe it is more a statement of having new members and tiers and to 
sound the marketing drums a bit. :)


The Ubuntu 24.04 release notes also claim that this release comes with 
Ceph Squid:


https://discourse.ubuntu.com/t/noble-numbat-release-notes/39890

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Squid released?

2024-04-29 Thread Alwin Antreich
On Mon, 29 Apr 2024 at 09:06, Robert Sander 
wrote:

> On 4/29/24 08:50, Alwin Antreich wrote:
>
> > well it says it in the article.
> >
> > The upcoming Squid release serves as a testament to how the Ceph
> > project continues to deliver innovative features to users without
> > compromising on quality.
> >
> >
> > I believe it is more a statement of having new members and tiers and to
> > sound the marketing drums a bit. :)
>
> The Ubuntu 24.04 release notes also claim that this release comes with
> Ceph Squid:
>
> https://discourse.ubuntu.com/t/noble-numbat-release-notes/39890
>
> Who knows. I don't see any packages on download.ceph.com for Squid.

Cheers,
Alwin

--
croit GmbH, Web  | LinkedIn
 | Youtube
 | Twitter

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Squid released?

2024-04-29 Thread Robert Sander

On 4/29/24 09:36, Alwin Antreich wrote:

Who knows. I don't see any packages on download.ceph.com 
 for Squid.


Ubuntu has them: https://packages.ubuntu.com/noble/ceph

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERN] cache pressure?

2024-04-29 Thread Dietmar Rieder
there might be situations/dicrectories for which it makes a lot of sense 
that they are watched for changes and vscode is informed about these. So 
excluding the entire cephfs might not always be a good idea. But I guess 
you need to find out.


Dietmar


On 4/27/24 16:38, Erich Weiler wrote:
Actually should I be excluding my whole cephfs filesystem?  Like, if I 
mount it as /cephfs, should my stanza looks something like:


{
    "files.watcherExclude": {
   "**/.git/objects/**": true,
   "**/.git/subtree-cache/**": true,
   "**/node_modules/*/**": true,
  "**/.cache/**": true,
  "**/.conda/**": true,
  "**/.local/**": true,
  "**/.nextflow/**": true,
  "**/work/**": true,
  "**/cephfs/**": true
    }
}

On 4/27/24 12:24 AM, Dietmar Rieder wrote:

Hi Erich,

hope it helps. Let us know.

Dietmar


Am 26. April 2024 15:52:06 MESZ schrieb Erich Weiler 
:


    Hi Dietmar,

    We do in fact have a bunch of users running vscode on our HPC head
    node as well (in addition to a few of our general purpose
    interactive compute servers). I'll suggest they make the mods you
    referenced! Thanks for the tip.

    cheers,
    erich

    On 4/24/24 12:58 PM, Dietmar Rieder wrote:

    Hi Erich,

    in our case the "client failing to respond to cache pressure"
    situation is/was often caused by users how have vscode
    connecting via ssh to our HPC head node. vscode makes heavy use
    of file watchers and we have seen users with > 400k watchers.
    All these watched files must be held in the MDS cache and if you
    have multiple users at the same time running vscode it gets
    problematic.

    Unfortunately there is no global setting - at least none that we
    are aware of - for vscode to exclude certain files or
    directories from being watched. We asked the users to configure
    their vscode (Remote Settings -> Watcher Exclude) as follows:

    {
   "files.watcherExclude": {
  "**/.git/objects/**": true,
  "**/.git/subtree-cache/**": true,
  "**/node_modules/*/**": true,
     "**/.cache/**": true,
     "**/.conda/**": true,
     "**/.local/**": true,
     "**/.nextflow/**": true,
     "**/work/**": true
   }
    }

    ~/.vscode-server/data/Machine/settings.json

    To monitor and find processes with watcher you may use 
inotify-info

    >

    HTH
   Dietmar

    On 4/23/24 15:47, Erich Weiler wrote:

    So I'm trying to figure out ways to reduce the number of
    warnings I'm getting and I'm thinking about the one "client
    failing to respond to cache pressure".

    Is there maybe a way to tell a client (or all clients) to
    reduce the amount of cache it uses or to release caches
    quickly?  Like, all the time?

    I know the linux kernel (and maybe ceph) likes to cache
    everything for a while, and rightfully so, but I suspect in
    my use case it may be more efficient to more quickly purge
    the cache or to in general just cache way less overall...?

    We have many thousands of threads all doing different things
    that are hitting our filesystem, so I suspect the caching
    isn't really doing me much good anyway due to the churn, and
    probably is causing more problems than it helping...

    -erich



    ceph-users mailing list -- ceph-users@ceph.io
    To unsubscribe send an email to ceph-users-le...@ceph.io




    ceph-users mailing list -- ceph-users@ceph.io
    To unsubscribe send an email to ceph-users-le...@ceph.io







OpenPGP_signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Impact of large PG splits

2024-04-29 Thread Eugen Block

The split process completed over the weekend, the balancer did a great job:

MIN PGs | MAX PGs | MIN USE % | MAX USE %
322 | 338 | 73,3  | 75,5

Although the number of PGs per OSD differs a bit the usage per OSD is  
quite good (and more important). The new hardware also arrived, so  
there will be soon some more remapping. :-)

So I would consider this thread as closed, all good.

Zitat von Eugen Block :

No, we didn’t change much, just increased the max pg per osd to  
avoid warnings and inactive PGs in case a node would fail during  
this process. And the max backfills, of course.


Zitat von Frédéric Nass :


Hello Eugen,

Thanks for sharing the good news. Did you have to raise  
mon_osd_nearfull_ratio temporarily?


Frédéric.

- Le 25 Avr 24, à 12:35, Eugen Block ebl...@nde.ag a écrit :


For those interested, just a short update: the split process is
approaching its end, two days ago there were around 230 PGs left
(target are 4096 PGs). So far there were no complaints, no cluster
impact was reported (the cluster load is quite moderate, but still
sensitive). Every now and then a single OSD (not the same) reaches 85%
nearfull ratio, but that was expected since the first nearfull OSD was
the root cause of this operation. I expect the balancer to kick in as
soon as the backfill has completed or when there are less than 5%
misplaced objects.

Zitat von Anthony D'Atri :


One can up the ratios temporarily but it's all too easy to forget to
reduce them later, or think that it's okay to run all the time with
reduced headroom.

Until a host blows up and you don't have enough space to recover into.


On Apr 12, 2024, at 05:01, Frédéric Nass
 wrote:


Oh, and yeah, considering "The fullest OSD is already at 85% usage"
best move for now would be to add new hardware/OSDs (to avoid
reaching the backfill too full limit), prior to start the splitting
PGs before or after enabling upmap balancer depending on how the
PGs got rebalanced (well enough or not) after adding new OSDs.

BTW, what ceph version is this? You should make sure you're running
v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:
https://tracker.ceph.com/issues/53729

Cheers,
Frédéric.

- Le 12 Avr 24, à 10:41, Frédéric Nass
frederic.n...@univ-lorraine.fr a écrit :


Hello Eugen,

Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph
daemon osd.0
config show | grep osd_op_queue)

If WPQ, you might want to tune osd_recovery_sleep* values as they
do have a real
impact on the recovery/backfilling speed. Just lower  
osd_max_backfills to 1

before doing that.
If mClock scheduler then you might want to use a specific  
mClock profile as
suggested by Gregory (as osd_recovery_sleep* are not considered  
when using

mClock).

Since each PG involves reads/writes from/to apparently 18 OSDs  
(!) and this

cluster only has 240, increasing osd_max_backfills to any values
higher than
2-3 will not help much with the recovery/backfilling speed.

All the way, you'll have to be patient. :-)

Cheers,
Frédéric.

- Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :


Thank you for input!
We started the split with max_backfills = 1 and watched for a few
minutes, then gradually increased it to 8. Now it's backfilling with
around 180 MB/s, not really much but since client impact has to be
avoided if possible, we decided to let that run for a couple of hours.
Then reevaluate the situation and maybe increase the backfills a bit
more.

Thanks!

Zitat von Gregory Orange :


We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
with NVME RocksDB, used exclusively for RGWs, holding about 60b
objects. We are splitting for the same reason as you - improved
balance. We also thought long and hard before we began, concerned
about impact, stability etc.

We set target_max_misplaced_ratio to 0.1% initially, so we could
retain some control and stop it again fairly quickly if we weren't
happy with the behaviour. It also serves to limit the performance
impact on the cluster, but unfortunately it also makes the whole
process slower.

We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
issues with the cluster. We could go higher, but are not in a rush
at this point. Sometimes nearfull osd warnings get high and MAX
AVAIL on the data pool in `ceph df` gets low enough that we want to
interrupt it. So, we set pg_num to whatever the current value is
(ceph osd pool ls detail), and let it stabilise. Then the balancer
gets to work once the misplaced objects drop below the ratio, and
things balance out. Nearfull osds drop usually to zero, and MAX
AVAIL goes up again.

The above behaviour is because while they share the same threshold
setting, the autoscaler only runs every minute, and it won't run
when misplaced are over the threshold. Meanwhile, checks for the
next PG to split happen much more frequently, so the balancer never
wins that race.


We didn't know how long to expect it all to take, but decided th

[ceph-users] Re: which grafana version to use with 17.2.x ceph version

2024-04-29 Thread Eugen Block

Hi,

cephadm stores a local copy of the cephadm binary in  
/var/lib/ceph/{FSID}/cephad.{DIGEST}:


quincy-1:~ # ls -lrt /var/lib/ceph/{FSID}/cephadm.*
-rw-r--r-- 1 root root 350889 26. Okt 2023   
/var/lib/ceph/{FSID}/cephadm.f6868821c084cd9740b59c7c5eb59f0dd47f6e3b1e6fecb542cb44134ace8d78
-rw-r--r-- 1 root root 364715 26. Okt 2023   
/var/lib/ceph/{FSID}/cephadm.7ab03136237675497d535fb1b85d1d0f95bbe5b95f32cd4e6f3ca71a9f97bf3c
-rwxr-xr-x 1 root root 366903 29. Nov 15:34  
/var/lib/ceph/{FSID}/cephadm.8b92cafd937eb89681ee011f9e70f85937fd09c4bd61ed4a59981d275a1f255b


The file with the latest timestamp usually is the currently running  
ceph version. You can inspect the file to see the (hard-coded) default  
container images:


quincy-1:~ # grep -E "DEFAULT_NODE_EXPORTER_IMAGE  
=|DEFAULT_GRAFANA_IMAGE =|DEFAULT_PROMETHEUS_IMAGE ="  
/var/lib/ceph/{FSID}/cephadm.8b92cafd937eb89681ee011f9e70f85937fd09c4bd61ed4a59981d275a1f255b

DEFAULT_PROMETHEUS_IMAGE = 'quay.io/prometheus/prometheus:v2.43.0'
DEFAULT_NODE_EXPORTER_IMAGE = 'quay.io/prometheus/node-exporter:v1.5.0'
DEFAULT_GRAFANA_IMAGE = 'quay.io/ceph/ceph-grafana:9.4.7'

There are more images defined, of course.
This can be helpful if you have your own container registry to know  
which images to mirror, or if your monitoring services are not  
deployed by cephadm. If you don't override these configs:


mgr/cephadm/container_image_grafana
mgr/cephadm/container_image_node_exporter
mgr/cephadm/container_image_prometheus
... etc.

then the defaults will be used (only if you deploy monitoring via cephadm).

Regards,
Eugen

Zitat von Adam King :


FWIW, cephadm uses `quay.io/ceph/ceph-grafana:9.4.7` as the default grafana
image in the quincy branch

On Tue, Apr 23, 2024 at 11:59 AM Osama Elswah 
wrote:


Hi,


in quay.io I can find a lot of grafana versions for ceph (
https://quay.io/repository/ceph/grafana?tab=tags) how can I find out
which version should be used when I upgrade my cluster to 17.2.x ? Can I
simply take the latest grafana version? Or is there a specfic grafana
version I need to use?


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Squid released?

2024-04-29 Thread Josh Durgin
On Sun, Apr 28, 2024 at 11:50 PM Alwin Antreich 
wrote:

> Hi Robert,
>
> well it says it in the article.
>
> > The upcoming Squid release serves as a testament to how the Ceph project
> > continues to deliver innovative features to users without compromising on
> > quality.
>
>
> I believe it is more a statement of having new members and tiers and to
> sound the marketing drums a bit. :)
>

Yes, this is a statement about the future release - Squid is not out yet.
We expect an RC in the next couple of weeks, and assuming that goes well,
final release in a month or so.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd-mirror failed to query services: (13) Permission denied

2024-04-29 Thread Ilya Dryomov
On Tue, Apr 23, 2024 at 8:28 PM Stefan Kooman  wrote:
>
> On 23-04-2024 17:44, Ilya Dryomov wrote:
> > On Mon, Apr 22, 2024 at 7:45 PM Stefan Kooman  wrote:
> >>
> >> Hi,
> >>
> >> We are testing rbd-mirroring. There seems to be a permission error with
> >> the rbd-mirror user. Using this user to query the mirror pool status gives:
> >>
> >> failed to query services: (13) Permission denied
> >>
> >> And results in the following output:
> >>
> >> health: UNKNOWN
> >> daemon health: UNKNOWN
> >> image health: OK
> >> images: 3 total
> >>   2 replaying
> >>   1 stopped
> >>
> >> So, this command: rbd --id rbd-mirror mirror pool status rbd
> >
> > Hi Stefan,
> >
> > What is the output of "ceph auth get client.rbd-mirror"?
>
> [client.rbd-mirror]
> key = REDACTED
> caps mon = "profile rbd-mirror"
> caps osd = "profile rbd"

Hi Stefan,

I went through the git history and this appears to be expected, at
least for some definition of expected.  Commit [1] clearly recognized
the problem and made the

rbd: failed to query services: (13) Permission denied

error that you ran into with "rbd mirror pool status" non-fatal.

Also, there is a comment in the respective PR [2] acknowledging that
even

caps mgr = "profile rbd"

cap (which your client.rbd-mirror user doesn't have and rbd-mirror
daemon doesn't actually need) would NOT be sufficient to resolve the
error because "our profiles don't give the average user access to see
Ceph cluster services".

[1] 
https://github.com/ceph/ceph/pull/33219/commits/1cb9e3b56932a1b00850b9cce4c65f8681dcc3cc
[2] https://github.com/ceph/ceph/pull/33219#discussion_r378436795

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph recipe for nfs exports

2024-04-29 Thread Roberto Maggi @ Debian

Hi you all,

I did the changes suggested but the situation is still the same, I set 
the squash to "all", since I want only "nobody:nogroup" ids but I can't 
understand where the path should point.
If I understood it well, I pass a far disks ( unpartitioined and so 
unformatted ) to the osd daemon, and then I create nfs daemons, ceph 
will autonomously link the nfs shares on the fs managed by the osds, is 
it correct?



Don't I have to create any filesystem on the osd?

By the way this is the dump

root@cephstage01:~# ceph nfs export info nfs-cephfs /mnt
{
  "access_type": "RW",
  "clients": [],
  "cluster_id": "nfs-cephfs",
  "export_id": 1,
  "fsal": {
    "fs_name": "vol1",
    "name": "CEPH",
    "user_id": "nfs.nfs-cephfs.1"
  },
  "path": "/",
  "protocols": [
    4
  ],
  "pseudo": "/mnt",
  "security_label": true,
  "squash": "all",
  "transports": [
    "TCP"
  ]
}

I can mount it correctly, but when I try to write or touch any file in 
it, it returns me "Permission denied"


❯ sudo mount -t nfs -o nfsvers=4.1,proto=tcp 192.168.7.80:/mnt /mnt/ceph
❯ touch /mnt/ceph/pino
touch: cannot touch '/mnt/ceph/pino': Permission denied


any suggestion will be appreciated


Rob


On 4/24/24 16:05, Adam King wrote:


- Although I can mount the export I can't write on it

What error are you getting trying to do the write? The way you set 
things up doesn't look to different than one of our integration tests 
for ingress over nfs 
(https://github.com/ceph/ceph/blob/main/qa/suites/orch/cephadm/smoke-roleless/2-services/nfs-ingress.yaml) 
and that test tests a simple read/write to the export after 
creating/mounting it.


- I can't understand how to use the sdc disks for journaling


you should be able to specify a `journal_devices` section in an OSD 
spec. For example


*service_type: osd
service_id: foo
placement:
  hosts:
  - vm-00
spec:
  data_devices:
    paths:
    - /dev/vdb
  journal_devices:
    paths:
    - /dev/vdc*
that will make non-colocated OSDs where the devices from the 
journal_devices section are used as journal devices for the OSDs on 
the devices in the data_devices section. Although I'd recommend 
looking through 
https://docs.ceph.com/en/latest/cephadm/services/osd/#advanced-osd-service-specifications 
 and 
see if there are any other filtering options than the path that can be 
used first. It's possible the path the device gets can change on 
reboot and you could end up with cepadm using a device you don't want 
it to for this as that other device gets the path another device held 
previously.


- I can't understand the concept of "pseudo path"


I don't know at a low level either, but it seems to just be the path 
nfs-ganesha will present to the user. There is another argument to 
`ceph nfs export create` which is just "path" rather than pseudo-path 
that marks what actual path within the cephfs the export is mounted 
on. It's optional and defaults to "/" (so the export you made is 
mounted at the root of the fs). I think that's the one that really 
matters. The pseudo-path seems to just act like a user facing name for 
the path.


On Wed, Apr 24, 2024 at 3:40 AM Roberto Maggi @ Debian 
 wrote:


Hi you all,

I'm almost new to ceph and I'm understanding, day by day, why the
official support is so expansive :)


I setting up a ceph nfs network cluster whose recipe can be found
here
below.

###

--> cluster creation cephadm bootstrap --mon-ip 10.20.20.81
--cluster-network 10.20.20.0/24  --fsid
$FSID --initial-dashboard-user adm \
--initial-dashboard-password 'Hi_guys' --dashboard-password-noupdate
--allow-fqdn-hostname --ssl-dashboard-port 443 \
--dashboard-crt /etc/ssl/wildcard.it/wildcard.it.crt
 --dashboard-key
/etc/ssl/wildcard.it/wildcard.it.key
 \
--allow-overwrite --cleanup-on-failure
cephadm shell --fsid $FSID -c /etc/ceph/ceph.conf -k
/etc/ceph/ceph.client.admin.keyring
cephadm add-repo --release reef && cephadm install ceph-common
--> adding hosts and set labels
for IP in $(grep ceph /etc/hosts | awk '{print $1}') ; do
ssh-copy-id -f
-i /etc/ceph/ceph.pub root@$IP ; done
ceph orch host add cephstage01 10.20.20.81 --labels
_admin,mon,mgr,prometheus,grafana
ceph orch host add cephstage02 10.20.20.82 --labels
_admin,mon,mgr,prometheus,grafana
ceph orch host add cephstage03 10.20.20.83 --labels
_admin,mon,mgr,prometheus,grafana
ceph orch host add cephstagedatanode01 10.20.20.84 --labels
osd,nfs,prometheus
ceph orch host add cephstagedatanode02 10.20.20.85 --labels
osd,nfs,prometheus
ceph orch host add cephstagedatanode03 10.20.20.86 --labels
osd,nfs,prometheus
--> network setup and daemons deploy
ceph config set mon 

[ceph-users] Re: ceph recipe for nfs exports

2024-04-29 Thread Robert Sander

On 4/29/24 17:21, Roberto Maggi @ Debian wrote:

I can mount it correctly, but when I try to write or touch any file in 
it, it returns me "Permission denied"


Works as expected. You have set squash to all, now every NFS client is 
mapped to nobody on the server.


But only root is able to write to the CephFS at first.

Set squash to "no_root_squash" to be able to write as root to the NFS 
share. Create a directory and change its permissions to someone else.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-04-29 Thread Erich Weiler

Hi Xiubo,

Is there any way to possibly get a PR development release we could 
upgrade to, in order to test and see if the lock order bug per Bug 
#62123 could be the answer?  Although I'm not sure that bug has been 
fixed yet?


-erich

On 4/21/24 9:39 PM, Xiubo Li wrote:

Hi Erich,

I raised one tracker for this https://tracker.ceph.com/issues/65607.

Currently I haven't figured out where was holding the 'dn->lock' in the 
'lookup' request or somewhere else, since there is not debug log.


Hopefully we can get the debug logs, which we can push it further.

Thanks

- Xiubo

On 4/19/24 23:55, Erich Weiler wrote:

Hi Xiubo,

Nevermind I was wrong, most the blocked ops were 12 hours old. Ug.

I restarted the MDS daemon to clear them.

I just reset to having one active MDS instead of two, let's see if 
that makes a difference.


I am beginning to think it may be impossible to catch the logs that 
matter here.  I feel like sometimes the blocked ops are just waiting 
because of load and sometimes they are waiting because they are stuck. 
But, it's really hard to tell which, without waiting a while.  But, I 
can't wait while having debug turned on because my root disks (which 
are 150 GB large) fill up with debug logs in 20 minutes.  So it almost 
seems that unless I could somehow store many TB of debug logs we won't 
be able to catch this.


Let's see how having one MDS helps.  Or maybe I actually need like 4 
MDSs because the load is too high for only one or two.  I don't know. 
Or maybe it's the lock issue you've been working on.  I guess I can 
test the lock order fix when it's available to test.


-erich

On 4/19/24 7:26 AM, Erich Weiler wrote:
So I woke up this morning and checked the blocked_ops again, there 
were 150 of them.  But the age of each ranged from 500 to 4300 
seconds.  So it seems as if they are eventually being processed.


I wonder if we are thinking about this in the wrong way?  Maybe I 
should be *adding* MDS daemons because my current ones are overloaded?


Can a single server hold multiple MDS daemons?  Right now I have 
three physical servers each with one MDS daemon on it.


I can still try reducing to one.  And I'll keep an eye on blocked ops 
to see if any get to a very old age (and are thus wedged).


-erich

On 4/18/24 8:55 PM, Xiubo Li wrote:

Okay, please try it to set only one active mds.


On 4/19/24 11:54, Erich Weiler wrote:

We have 2 active MDS daemons and one standby.

On 4/18/24 8:52 PM, Xiubo Li wrote:

BTW, how man active mds you are using ?


On 4/19/24 10:55, Erich Weiler wrote:
OK, I'm sure I caught it in the right order this time, the logs 
should definitely show when the blocked/slow requests start. 
Check out these logs and dumps:


http://hgwdev.gi.ucsc.edu/~weiler/

It's a 762 MB tarball but it uncompresses to 16 GB.

-erichll


On 4/18/24 6:57 PM, Xiubo Li wrote:

Okay, could you try this with 18.2.0 ?

I just double it was introduce by:

commit e610179a6a59c463eb3d85e87152ed3268c808ff
Author: Patrick Donnelly 
Date:   Mon Jul 17 16:10:59 2023 -0400

 mds: drop locks and retry when lock set changes

 An optimization was added to avoid an unnecessary gather on 
the inode
 filelock when the client can safely get the file size 
without also
 getting issued the requested caps. However, if a retry of 
getattr

 is necessary, this conditional inclusion of the inode filelock
 can cause lock-order violations resulting in deadlock.

 So, if we've already acquired some of the inode's locks 
then we must

 drop locks and retry.

 Fixes: https://tracker.ceph.com/issues/62052
 Fixes: c822b3e2573578c288d170d1031672b74e02dced
 Signed-off-by: Patrick Donnelly 
 (cherry picked from commit 
b5719ac32fe6431131842d62ffaf7101c03e9bac)



On 4/19/24 09:54, Erich Weiler wrote:
I'm on 18.2.1.  I think I may have gotten the timing off on the 
logs and dumps so I'll try again.  Just really hard to capture 
because I need to kind of be looking at it in real time to 
capture it. Hang on, lemme see if I can get another capture...


-erich

On 4/18/24 6:35 PM, Xiubo Li wrote:


BTW, which ceph version you are using ?



On 4/12/24 04:22, Erich Weiler wrote:
BTW - it just happened again, I upped the debugging settings 
as you instructed and got more dumps (then returned the debug 
settings to normal).


Attached are the new dumps.

Thanks again,
erich

On 4/9/24 9:00 PM, Xiubo Li wrote:


On 4/10/24 11:48, Erich Weiler wrote:
Dos that mean it could be the locker order bug 
(https://tracker.ceph.com/issues/62123) as Xiubo suggested?


I have raised one PR to fix the lock order issue, if 
possible please have a try to see could it resolve this 
issue.


Thank you!  Yeah, this issue is happening every couple days 
now. It just happened again today and I got more MDS dumps. 
If it would help, let me know and I can send them!


Once this happen if you could enable the mds debug logs will 
be better:


debug mds = 20

debug ms = 1

And then provide

[ceph-users] Re: MDS Behind on Trimming...

2024-04-29 Thread Xiubo Li


On 4/30/24 04:05, Erich Weiler wrote:

Hi Xiubo,

Is there any way to possibly get a PR development release we could 
upgrade to, in order to test and see if the lock order bug per Bug 
#62123 could be the answer?  Although I'm not sure that bug has been 
fixed yet?


I think you can get the test pacakges from 
https://shaman.ceph.com/builds/ceph/. But this need to trigger a build 
first.



-erich

On 4/21/24 9:39 PM, Xiubo Li wrote:

Hi Erich,

I raised one tracker for this https://tracker.ceph.com/issues/65607.

Currently I haven't figured out where was holding the 'dn->lock' in 
the 'lookup' request or somewhere else, since there is not debug log.


Hopefully we can get the debug logs, which we can push it further.

Thanks

- Xiubo

On 4/19/24 23:55, Erich Weiler wrote:

Hi Xiubo,

Nevermind I was wrong, most the blocked ops were 12 hours old. Ug.

I restarted the MDS daemon to clear them.

I just reset to having one active MDS instead of two, let's see if 
that makes a difference.


I am beginning to think it may be impossible to catch the logs that 
matter here.  I feel like sometimes the blocked ops are just waiting 
because of load and sometimes they are waiting because they are 
stuck. But, it's really hard to tell which, without waiting a 
while.  But, I can't wait while having debug turned on because my 
root disks (which are 150 GB large) fill up with debug logs in 20 
minutes.  So it almost seems that unless I could somehow store many 
TB of debug logs we won't be able to catch this.


Let's see how having one MDS helps.  Or maybe I actually need like 4 
MDSs because the load is too high for only one or two. I don't know. 
Or maybe it's the lock issue you've been working on.  I guess I can 
test the lock order fix when it's available to test.


-erich

On 4/19/24 7:26 AM, Erich Weiler wrote:
So I woke up this morning and checked the blocked_ops again, there 
were 150 of them.  But the age of each ranged from 500 to 4300 
seconds.  So it seems as if they are eventually being processed.


I wonder if we are thinking about this in the wrong way? Maybe I 
should be *adding* MDS daemons because my current ones are overloaded?


Can a single server hold multiple MDS daemons?  Right now I have 
three physical servers each with one MDS daemon on it.


I can still try reducing to one.  And I'll keep an eye on blocked 
ops to see if any get to a very old age (and are thus wedged).


-erich

On 4/18/24 8:55 PM, Xiubo Li wrote:

Okay, please try it to set only one active mds.


On 4/19/24 11:54, Erich Weiler wrote:

We have 2 active MDS daemons and one standby.

On 4/18/24 8:52 PM, Xiubo Li wrote:

BTW, how man active mds you are using ?


On 4/19/24 10:55, Erich Weiler wrote:
OK, I'm sure I caught it in the right order this time, the logs 
should definitely show when the blocked/slow requests start. 
Check out these logs and dumps:


http://hgwdev.gi.ucsc.edu/~weiler/

It's a 762 MB tarball but it uncompresses to 16 GB.

-erichll


On 4/18/24 6:57 PM, Xiubo Li wrote:

Okay, could you try this with 18.2.0 ?

I just double it was introduce by:

commit e610179a6a59c463eb3d85e87152ed3268c808ff
Author: Patrick Donnelly 
Date:   Mon Jul 17 16:10:59 2023 -0400

 mds: drop locks and retry when lock set changes

 An optimization was added to avoid an unnecessary gather 
on the inode
 filelock when the client can safely get the file size 
without also
 getting issued the requested caps. However, if a retry of 
getattr
 is necessary, this conditional inclusion of the inode 
filelock

 can cause lock-order violations resulting in deadlock.

 So, if we've already acquired some of the inode's locks 
then we must

 drop locks and retry.

 Fixes: https://tracker.ceph.com/issues/62052
 Fixes: c822b3e2573578c288d170d1031672b74e02dced
 Signed-off-by: Patrick Donnelly 
 (cherry picked from commit 
b5719ac32fe6431131842d62ffaf7101c03e9bac)



On 4/19/24 09:54, Erich Weiler wrote:
I'm on 18.2.1.  I think I may have gotten the timing off on 
the logs and dumps so I'll try again.  Just really hard to 
capture because I need to kind of be looking at it in real 
time to capture it. Hang on, lemme see if I can get another 
capture...


-erich

On 4/18/24 6:35 PM, Xiubo Li wrote:


BTW, which ceph version you are using ?



On 4/12/24 04:22, Erich Weiler wrote:
BTW - it just happened again, I upped the debugging 
settings as you instructed and got more dumps (then 
returned the debug settings to normal).


Attached are the new dumps.

Thanks again,
erich

On 4/9/24 9:00 PM, Xiubo Li wrote:


On 4/10/24 11:48, Erich Weiler wrote:
Dos that mean it could be the locker order bug 
(https://tracker.ceph.com/issues/62123) as Xiubo 
suggested?


I have raised one PR to fix the lock order issue, if 
possible please have a try to see could it resolve this 
issue.


Thank you!  Yeah, this issue is happening every couple 
days now. It just happened again today and I got more MDS 
dumps. If it w