Hello Aurelien,

I'm working with Philippe WEILL and I'm Philippe too ;o)

We first met the problem a few months ago.
And it happened again yesterday after the maintenance window.
On production we now have all servers and clients running Lustre 2.15.5.

We reproduced the problem with 3 RockyLinux 8.10 VMs running Lustre 2.15.5 (1x 
mds-mgs, 2x oss and 1x client).
We wonder if it's be related to a misuse of the changelog mask (='MARK MTIME 
CTIME' vs ='+MTIME +CTIME') ?

## Making the problem happen :

[root@test-mds-mgs ~]# lctl set_param -P 
mdd.lustre-MDT0000.changelog_mask='MARK MTIME CTIME'
[root@test-mds-mgs ~]# reboot
[root@test-mds-mgs ~]# mount -t lustre /dev/sdb /mnt/mgt/
[root@test-mds-mgs ~]# mount -t lustre /dev/sdc /mnt/mdt/
[root@test-mds-mgs ~]# lctl get_param mdd.lustre-MDT0000.changelog_mask
mdd.lustre-MDT0000.changelog_mask=MARK MTIME CTIME

[root@test-rbh-cl-215 lustre]# LANG=C touch aeffacer
touch: setting times of 'aeffacer': Input/output error

[root@test-mds-mgs ~]# LANG=C dmesg -T
...
[Thu Nov 21 10:54:24 2024] Lustre: Lustre: Build Version: 2.15.5
[Thu Nov 21 10:54:24 2024] LNet: Added LNI 172.20.240.172@tcp [8/256/0/180]
[Thu Nov 21 10:54:24 2024] LNet: Accept secure, port 988
[Thu Nov 21 10:54:24 2024] LDISKFS-fs (sdb): mounted filesystem with ordered 
data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Thu Nov 21 10:54:35 2024] LDISKFS-fs (sdc): mounted filesystem with ordered 
data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Thu Nov 21 10:54:35 2024] LustreError: 137-5: lustre-MDT0000_UUID: not 
available for connect from 172.20.240.171@tcp (no target). If you are running 
an HA pair check that the target is mounted on the other server.
[Thu Nov 21 10:54:35 2024] Lustre: lustre-MDT0000: Imperative Recovery not 
enabled, recovery window 300-900
[Thu Nov 21 10:54:35 2024] Lustre: lustre-MDD0000: changelog on
[Thu Nov 21 10:55:26 2024] Lustre: lustre-MDT0000: Will be in recovery for at 
least 5:00, or until 1 client reconnects
[Thu Nov 21 10:55:26 2024] Lustre: lustre-MDT0000: Recovery over after 0:01, of 
1 clients 1 recovered and 0 were evicted.
[Thu Nov 21 10:55:26 2024] LustreError: 
1907:0:(llog_cat.c:543:llog_cat_current_log()) lustre-MDD0000: next log does 
not exist!
...

## "Solving" the problem:

[root@test-mds-mgs ~]# lctl set_param -P 
mdd.lustre-MDT0000.changelog_mask='+SATTR'
[root@test-mds-mgs ~]# reboot
[root@test-mds-mgs ~]# mount -t lustre /dev/sdb /mnt/mgt/
[root@test-mds-mgs ~]# mount -t lustre /dev/sdc /mnt/mdt/
[root@test-mds-mgs ~]# lctl get_param mdd.lustre-MDT0000.changelog_mask
mdd.lustre-MDT0000.changelog_mask=
MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC 
SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC

[root@test-rbh-cl-215 lustre]# touch aeffacer
[root@test-rbh-cl-215 lustre]# ll aeffacer
-rw-r--r-- 1 root root 0 21 nov.  11:03 aeffacer

[root@test-mds-mgs ~]# LANG=C dmesg -T
...
[Thu Nov 21 11:02:52 2024] Lustre: Lustre: Build Version: 2.15.5
[Thu Nov 21 11:02:52 2024] LNet: Added LNI 172.20.240.172@tcp [8/256/0/180]
[Thu Nov 21 11:02:52 2024] LNet: Accept secure, port 988
[Thu Nov 21 11:02:53 2024] LDISKFS-fs (sdb): mounted filesystem with ordered 
data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Thu Nov 21 11:02:57 2024] LDISKFS-fs (sdc): mounted filesystem with ordered 
data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Thu Nov 21 11:02:57 2024] Lustre: lustre-MDT0000: Imperative Recovery not 
enabled, recovery window 300-900
[Thu Nov 21 11:02:57 2024] Lustre: lustre-MDD0000: changelog on
[Thu Nov 21 11:03:27 2024] Lustre: lustre-MDT0000: Will be in recovery for at 
least 5:00, or until 1 client reconnects
[Thu Nov 21 11:03:27 2024] Lustre: lustre-MDT0000: Recovery over after 0:01, of 
1 clients 1 recovered and 0 were evicted.

Philippe


----- Mail original -----
De: "Philippe Weill" <philippe.we...@latmos.ipsl.fr>
À: "Aurelien Degremont" <adegrem...@nvidia.com>, lustre-discuss@lists.lustre.org
Envoyé: Mercredi 20 Novembre 2024 17:44:16
Objet: Re: [lustre-discuss] Report Strange Problem on 2.15.5 with changelog_mask

On 20/11/2024 16:24, Aurelien Degremont wrote:
> Hello Philippe,
> 
> I do not see why changing the changelog mask would cause I/O error, 
> especially as this seems transient.
> Did you happen to have any errors on your client hosts or MDS hosts as the 
> time of your testing ? (see dmesg)


hello

no we did not see and we have reproduced the problem with 3 vm Rocky 8.10 with 
fresh 2.15.5  ( 1 mds , 1 oss , 1 client )


> 
> 
> Aurélien
> ------------------------------------------------------------------------------------------------------------------------------------
> *De :* lustre-discuss <lustre-discuss-boun...@lists.lustre.org> de la part de 
> Philippe Weill <philippe.we...@latmos.ipsl.fr>
> *Envoyé :* mercredi 20 novembre 2024 07:11
> *À :* lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org>
> *Objet :* [lustre-discuss] Report Strange Problem on 2.15.5 with 
> changelog_mask
> External email: Use caution opening links or attachments
> 
> 
> Hello
> 
> after passing the following command on our lustre MDS
> 
> lctl set_param -P mdd.*-MDT0000.changelog_mask='MARK MTIME CTIME'
> 
> unmounting and remounting the mdt on mds
> 
> we had  error on touch chmod chgrp existing files
> 
> root@host:~# echo foobar > /scratch/root/foobar
> root@host:~# cat /scratch/root/foobar
> foobar
> root@host:~# echo foobar2 >>  /scratch/root/foobar
> root@host:~# cat /scratch/root/foobar
> foobar
> foobar2
> root@host:~# touch /scratch/root/foobar
> touch: setting times of '/scratch/root/foobar': Input/output error
> root@host:~# chgrp group /scratch/root/foobar
> chgrp: changing group of '/scratch/root/foobar': Input/output error
> root@host:~# chmod 666 /scratch/root/foobar
> chmod: changing permissions of '/scratch/root/foobar': Input/output error
> 
> 
> doing the following command
> 
> lctl set_param -P mdd.*-MDT0000.changelog_mask='-MARK -MTIME -CTIME'
> 
> 
> and only activating non permanently for our robinhood
> 
> lctl set_param  mdd.*-MDT0000.changelog_mask='MARK MTIME CTIME'
> 
> 
> [root@mds ~]#  lctl get_param  mdd.scratch-MDT0000.changelog_mask
> mdd.scratch-MDT0000.changelog_mask=MARK MTIME CTIME
> 
> 
> everything started to work again
> 
> Bug or bad use from us ?
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C02%7Cadegremont%40nvidia.com%7C8f3380bfd4aa40cc551108dd092a6512%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638676799899907487%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=PohwsBW6RfjTfXhTHh0WvnnRoRA6bbpZ1LJmSTlfEks%3D&reserved=0
>  <http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>

-- 
Weill Philippe -  Administrateur Systeme et Reseaux
CNRS/UPMC/IPSL   LATMOS (UMR 8190)
Tour 45/46 3e Etage B302|4 Place Jussieu|75252 Paris Cedex 05 -  FRANCE
Email:philippe.we...@latmos.ipsl.fr | tel:+33 0144274759
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to