Peter,
I've been reading the Baeldung pages, among others, to gain some insight
on Linux buffer cache behavior.
https://www.baeldung.com/linux/file-system-caching
andhttps://docs.kernel.org/admin-guide/sysctl/vm.html
As can been seen in the first image below, Lustre is having no trouble
keeping up with the dirty pages. Dirty pages are never more than 400MB
on a 64GB system, well under 1%. This dirty page data is drawn from
/proc/meminfo while dd is running. Here are some of the vm dirty settings.
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 40
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200
I am not sure what to make of your following comment. I should have
stated that the dd command used for this was *dd if=/dev/zero of=pfl.dat
bs=1M**count=8000* . I will also point out that I came across this
behavior while debugging another problem and I was simply using dd to
create a pfl striped file so I could check how the file was laid out on
the OSTs. Over the course of many runs I kept noticing the pauses in
the writes and it strikes me that the behavior is odd in that there is
typically a significant amount of inactive file pages and free memory (
second image below ). I don't understand why those inactive file pages
are not evicted, or free memory used, before evicting the pfl.dat pages
which were just written. What is driving the LRU eviction here? Also
should point out that the cached memory is always always well under the
50% limit that is configured as Lustre's max.
Also while you surely know better I usually try to avoid
buffering large amounts of to-be-written data in RAM (whether on
the OSC or the OSS), and to my taste 8GiB "in-flight" is large.
https://www.dropbox.com/scl/fi/5seamxgscdrat1eu2t5zn/dd_swapped.png?rlkey=oyicyq2a8eeqlgohndgalisy0&dl=0
On 12/6/23 14:24, [email protected] wrote:
Send lustre-discuss mailing list submissions to
[email protected]
To subscribe or unsubscribe via the World Wide Web, visit
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
or, via email, send a message with subject or body 'help' to
[email protected]
You can reach the person managing the list at
[email protected]
When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."
Today's Topics:
1. Coordinating cluster start and shutdown? (Jan Andersen)
2. Re: Lustre caching and NUMA nodes (Peter Grandi)
3. Re: Coordinating cluster start and shutdown?
(Bertschinger, Thomas Andrew Hjorth)
4. Lustre server still try to recover the lnet reply to the
depreciated clients (Huang, Qiulan)
----------------------------------------------------------------------
Message: 1
Date: Wed, 6 Dec 2023 10:27:11 +0000
From: Jan Andersen<[email protected]>
To: lustre<[email protected]>
Subject: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:<[email protected]>
Content-Type: text/plain; charset=UTF-8; format=flowed
Are there any tools for coordinating the start and shutdown of lustre
filesystem, so that the OSS systems don't attempt to mount disks before the MGT
and MDT are online?
------------------------------
Message: 2
Date: Wed, 6 Dec 2023 12:40:54 +0000
From:[email protected] (Peter Grandi)
To: list Lustre discussion<[email protected]>
Subject: Re: [lustre-discuss] Lustre caching and NUMA nodes
Message-ID:<[email protected]>
Content-Type: text/plain; charset=iso-8859-1
I have a an OSC caching question.? I am running a dd process
which writes an 8GB file.? The file is on lustre, striped
8x1M.
How the Lustre instance servers store the data may not have a
huge influence on what happens in the client's system buffer
cache.
This is run on a system that has 2 NUMA nodes (? cpu sockets).
[...] Why does lustre go to the trouble of dumping node1 and
then not use node1's memory, when there was always plenty of
free memory on node0?
What makes you think "lustre" is doing that?
Are you aware of the values of the flusher settings such as
'dirty_bytes', 'dirty_ratio', 'dirty_expire_centisecs'?
Have you considered looking at NUMA policies e.g. as described
in 'man numactl'?
Also while you surely know better I usually try to avoid
buffering large amounts of to-be-written data in RAM (whether on
the OSC or the OSS), and to my taste 8GiB "in-flight" is large.
------------------------------
Message: 3
Date: Wed, 6 Dec 2023 16:00:38 +0000
From: "Bertschinger, Thomas Andrew Hjorth"<[email protected]>
To: Jan Andersen<[email protected]>, lustre
<[email protected]>
Subject: Re: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:
<ph8pr09mb103611a4b55e420410ae14abaab...@ph8pr09mb10361.namprd09.prod.outlook.com>
Content-Type: text/plain; charset="iso-8859-1"
Hello Jan,
You can use the Pacemaker / Corosync high-availability software stack for this:
specifically, ordering constraints [1] can be used.
Unfortunately, Pacemaker is probably over-the-top if you don't need HA -- its
configuration is complex and difficult to get right, and it significantly
complicates system administration. One downside of Pacemaker is that it is not
easy to decouple the Pacemaker service from the Lustre services, meaning if you
stop the Pacemaker service, it will try to stop all of the Lustre services.
This might make it inappropriate for use cases that don't involve HA.
Given those downsides, if others in the community have suggestions on simpler
means to accomplish this, I'd love to see other tools that can be used here
(especially officially supported ones, if they exist).
[1]https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/constraints.html#specifying-the-order-in-which-resources-should-start-stop
- Thomas Bertschinger
________________________________________
From: lustre-discuss<[email protected]> on behalf of Jan
Andersen<[email protected]>
Sent: Wednesday, December 6, 2023 3:27 AM
To: lustre
Subject: [EXTERNAL] [lustre-discuss] Coordinating cluster start and shutdown?
Are there any tools for coordinating the start and shutdown of lustre
filesystem, so that the OSS systems don't attempt to mount disks before the MGT
and MDT are online?
_______________________________________________
------------------------------
Message: 4
Date: Wed, 6 Dec 2023 20:23:11 +0000
From: "Huang, Qiulan"<[email protected]>
To:"[email protected]"
<[email protected]>
Cc: "Huang, Qiulan"<[email protected]>
Subject: [lustre-discuss] Lustre server still try to recover the lnet
reply to the depreciated clients
Message-ID:
<blapr09mb685012e0f741e1b98f65f8c6ce...@blapr09mb6850.namprd09.prod.outlook.com>
Content-Type: text/plain; charset="iso-8859-1"
Hello all,
We removed some clients two weeks ago but we see the Lustre server is still
trying to handle the lnet recovery reply to those clients (the error log is
posted as below). And they are still listed in the exports dir.
I tried to run to evict the clients but failed with the error "no exports
found"
lctl set_param mdt.*.evict_client=10.68.178.25@tcp
Do you know how to clean up the removed the depreciated clients? Any
suggestions would be greatly appreciated.
For example:
[root@mds2 ~]# ll /proc/fs/lustre/mdt/data-MDT0000/exports/10.67.178.25@tcp/
total 0
-r--r--r-- 1 root root 0 Dec 5 15:41 export
-r--r--r-- 1 root root 0 Dec 5 15:41 fmd_count
-r--r--r-- 1 root root 0 Dec 5 15:41 hash
-rw-r--r-- 1 root root 0 Dec 5 15:41 ldlm_stats
-r--r--r-- 1 root root 0 Dec 5 15:41 nodemap
-r--r--r-- 1 root root 0 Dec 5 15:41 open_files
-r--r--r-- 1 root root 0 Dec 5 15:41 reply_data
-rw-r--r-- 1 root root 0 Aug 14 10:58 stats
-r--r--r-- 1 root root 0 Dec 5 15:41 uuid
/var/log/messages:Dec 6 12:50:17 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous
similar message
/var/log/messages:Dec 6 13:05:17 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI
(10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 13:05:17 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous
similar message
/var/log/messages:Dec 6 13:20:17 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI
(10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 13:20:17 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous
similar message
/var/log/messages:Dec 6 13:35:17 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI
(10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 13:35:17 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous
similar message
/var/log/messages:Dec 6 13:50:17 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI
(10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 13:50:17 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous
similar message
/var/log/messages:Dec 6 14:05:17 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI
(10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 14:05:17 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous
similar message
/var/log/messages:Dec 6 14:20:16 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI
(10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 14:20:16 mds2 kernel: LNetError:
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous
similar message
/var/log/messages:Dec 6 14:30:17 mds2 kernel: LNetError:
3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI
(10.67.176.25@tcp) recovery failed with -111
/var/log/messages:Dec 6 14:30:17 mds2 kernel: LNetError:
3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 3 previous
similar messages
/var/log/messages:Dec 6 14:47:14 mds2 kernel: LNetError:
3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI
(10.67.176.25@tcp) recovery failed with -111
/var/log/messages:Dec 6 14:47:14 mds2 kernel: LNetError:
3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 8 previous
similar messages
/var/log/messages:Dec 6 15:02:14 mds2 kernel: LNetError:
3817248:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI
(10.67.176.25@tcp) recovery failed with -111
Regards,
Qiulan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:<http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231206/89b7c124/attachment.htm>
------------------------------
Subject: Digest Footer
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
------------------------------
End of lustre-discuss Digest, Vol 213, Issue 7
**********************************************
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org