Re: [lustre-discuss] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker

Laura Hild via lustre-discuss Mon, 17 Mar 2025 07:22:44 -0700

You'll notice the fix is in the system configuration rather than in Pacemaker 
itself and what the fix does effectively is choose which of the possibilities 
to go with.  Just as an example, I'm definitely *not* recommending this, but 
you could probably also have, say, modified pacemaker.service to put the 
cluster in maintenance mode before it is stopped.


It is more important that I say that 
/usr/lib/systemd/system/resource-agents-deps.target is owned by the 
resource-agents package, and if you're going to do what you said as your 
solution, you should *not* modify the file in /usr/lib, but rather do it as an 
actual drop-in as described in what Oyvind linked, i.e. in a .conf file with 
the first line "[Unit]" in the 
/etc/systemd/system/resource-agents-deps.target.d/ directory.



________________________________________
Od: chenzu...@gmail.com <chenzu...@gmail.com>
Poslano: petek, 14. marec 2025 23:06
Za: Laura Hild
Kp: lustre-discuss
Zadeva: Re: [lustre-discuss] Lustre MDT/OST Mount Failures During Virtual 
Machine Reboot with Pacemaker

Thank you for your advice.

A user named Oyvind replied on the us...@clusterlabs.org mailing list:
You need the systemd drop-in functionality introduced in RHEL 9.3
to avoid this issue: https://bugzilla.redhat.com/show_bug.cgi?id=2184779

The reason I understand is as follows:
During reboot, both the system and Pacemaker will unmount the Lustre resource 
simultaneously.
If the system unmounts first and Pacemaker unmounts afterward, Pacemaker will 
immediately return success.
However, at this point, the system's unmounting process is not yet complete,
causing Pacemaker to mount on the target end, which triggers this issue.

My current modification is as follows:
Add the following lines to the file 
`/usr/lib/systemd/system/resource-agents-deps.target`:
```
After=remote-fs.target
Before=shutdown.target reboot.target halt.target
```

After making this modification, the issue no longer occurs during reboot.
________________________________
chenzu...@gmail.com



From: Laura Hild<mailto:l...@jlab.org>
Date: 2025-03-06 06:12
To: chenzu...@gmail.com<mailto:chenzu...@gmail.com>
CC: lustre-discuss<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lustre MDT/OST Mount Failures During Virtual 
Machine Reboot with Pacemaker
I'm not sure what to say about how Pacemaker *should* behave, but I *can* say I 
virtually never try to (cleanly) reboot a host from which I have not already 
evacuated all resources, e.g. with `pcs node standby` or by putting Pacemaker 
in maintenance mode and unmounting/exporting everything manually.  If I can't 
evacuate all resources and complete a lustre_rmmod, the host is getting 
power-cycled.

So maybe I can say, my guess would be that in the host's shutdown process, 
stopping the Pacemaker service happens before filesystems are unmounted, and 
that Pacemaker doesn't want to make an assumption whether its own shut-down 
means it should standby or initiate maintenance mode, and therefore the other 
host ends up knowing only that its partner has disappeared, while the 
filesystems have yet to be unmounted.

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker

Reply via email to