[slurm-users] sacct does always print all jobs regardless filter parameters with accounting_storage/filetxt
Hi, I happen to run a small cluster that doesn't use a slurmdbd but the plain log in a text file, accounting_storage/filetxt in slurm.conf. It is running Ubuntu 16.04 LTS and the slurm 15.08.7 provided with that. Can someone tell me if it is normal (or at least a known bug) that with this setup, sacct -X -N node04 always returns the full list of all jobs recorded in the accounting file? Likewise for filtering for start/endtime with -S/-E, no effect at all. It is also apparent that it really takes a long time for sacct to digest the text file, even if it is reduced to less than 5 jobs and about 6M in size. To give a figure: 24 seconds user time, keeping one core of a Xeon E5-2609 v2 @ 2.5 GHz busy. Alrighty then, Thomas PS: Please no pointers about better running a proper database with slurmdbd … I know that that works;-) -- Dr. Thomas Orgis HPC @ Universität Hamburg
Re: [slurm-users] sacct does always print all jobs regardless filter parameters with accounting_storage/filetxt
Am Thu, 30 Jan 2020 19:07:59 +0300 schrieb mercan : > Note: The filetxt plugin records only a limited subset of accounting > information and will prevent some sacct options from proper operation. Thank you for looking this up. But since the filetxt does contain the start/end timestamps and the nodes the job ran on, it is strange that sacct should not be able to filter on those criteria. This is an example line of the accounting file, with obvious identifying fields replaced by : batch 1548429637 1548429637 - - 0 1 4294536312 48 node[09-15,22] (null) So, matching for job ID, user name (via numerical uid lookup), timestamps and the nodes should be possible, it's all there. Can someone confirm that it indeed is the case that _none_ of the filtering options of sacct are supposed to work on filetxt? Alrighty then, Thomas -- Dr. Thomas Orgis HPC @ Universität Hamburg
Re: [slurm-users] sacct does always print all jobs regardless filter parameters with accounting_storage/filetxt
Am Thu, 30 Jan 2020 19:03:38 +0100 schrieb "Dr. Thomas Orgis" : > batch 1548429637 1548429637 - - 0 1 4294536312 > 48 node[09-15,22] (null) > > So, matching for job ID, user name (via numerical uid lookup), > timestamps and the nodes should be possible, it's all there. > > Can someone confirm that it indeed is the case that _none_ of the > filtering options of sacct are supposed to work on filetxt? Matching for user (-u) and Job ID (-j) works, but not -N/-S/-E. So is this just the current state and it's up to me to provide a patch to enable it if I want that behaviour? Alrighty then, Thomas -- Dr. Thomas Orgis HPC @ Universität Hamburg
Re: [slurm-users] sacct does always print all jobs regardless filter parameters with accounting_storage/filetxt
Am Fri, 31 Jan 2020 20:57:16 -0800 schrieb Chris Samuel : > You're using a very very very old version of slurm there (15.08) Well, that's what happens when an application gets into the mainstream and is included in the OS distribution. On this cluster, we just try to run with what Ubuntu LTS gives us. And I guess it is not atypical to keep a base OS release during the lifetime of a cluster, while this one already had 2 upgrades to newer LTS (another one probably should ivolve a rebuilding of the cluster) … You should get accustomed to people keeping asking questions about _really_ outdated Slurm versons;-) > should upgrade to a recent one (I'd suggest 19.05.5) to check whether > it's been fixed in the intervening years. But this seems to confirm my suspicion that at least people were not concerned that much with the simple text storage … otherwise it would be better known if this works or not. I tried running sacct from 19.05.5 on our config without touching the running instance of Slurm and it at least doesn't complain. The only difference I can see is that the default changed from showing all jobs in the history to showing none. Adding -u and -j works, but -N/-S/-E are still ignored. So apparently, during development of Slurm the journey quickly moved on to database storage before people got annoyed by long job lists, or the functionality got removed accidentally after the plain text storage went out of fashion. Well, since running a modern sacct directly on the old system works, I might get around to hack in the missing matches. Any helpful pointers before I blindly dig into the code? Alrighty then, Thomas -- Dr. Thomas Orgis HPC @ Universität Hamburg
[slurm-users] Access to slurm job cgroups in prolog/epilog script
Hi, I am wondering about the exact execution order of prolog scripts and plugins in Slurm with the goal to be able to access the freshly created cgroups (by the task/cgroup plugin) in our prolog/epilog scripts, which run with PrologFlags=Alloc to ensure the tranditional batch system behaviour. We want some information, namely, the prepared cpuset for the job in the prolog and the statistics/counter differences in the epilog. I am aware of accounting and profiling options using slurm and plugins, but there are reasons I want to handle cgroup information myself; maybe even to experiment with things that might go into a slurm plugin at some point. The job cgroups are created after the prolog scripts ran and destroyed before the epilog scripts run (correct? — looks like that). The design seems to focus on individual job steps, having things run closely coupled to the possibly multiple componentes (steps, tasks) for batch jobs that I only have a hazy concept of. Is there a standard way to get the cgroup hierarchy for the job created early, before the per-node prolog script that runs as root (slurmd user), and final cleanup happening later, after the epilog ran? If config doesn't do it, I thought about modifying task/cgroup, but I suspect that the whole scope of the plugin is between the epilogs. Can someone confirm that? I welcome pointers to documentation that explains in detail when which parts of a plugin is run in relation to the slot the prolog scripts get. With https://bugs.schedmd.com/show_bug.cgi?id=9429, there seems to be a way to keep the cgroup around longer, just sabotage the cleanup phase and do it later in the epilog (as I do now on a Ubuntu 20.04 cluster with the distro-provided slurmd that suffers from this bug). But will e.g. moving the code from task_p_pre_setuid() to task_p_slurmd_reserve_resources() give me early access in the prolog? I might just try and break something, but I didn't find yet documentation on these details of the plugin API and for once thought asking around first might be also good. I want cpuset information and at least things like per-node memory high-water marks. The desired granularity is at the job level and it would be nice to get rid of inefficient timeseries to approximate that. The cpuset is needed in advance to user programs starting as I hook a listener to the taskstats interface to cheaply and accurately account for user processes (kernel tasks) with command names. My profiling is somewhere between the hdf5 timeseries and the rought values you get out of sacct, with an orthogonal bit about kernel tasks (to tell the user how many python processes wasted how much memory each). Alrighty then, Thomas PS: I guess lots is possible by writing a custom plugin that ties in with what my prolog/epilog scripts do, but I'd prefer a light touch first. Hacking the scripts during development is far more convenient. -- Dr. Thomas Orgis HPC @ Universität Hamburg
Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram
Am Mon, 06 Mar 2023 13:35:38 +0100 schrieb Stefan Staeglich : > But this fixed not the main error but might have reduced the frequency of > occurring. Has someone observed similar issues? We will try a higher > SuspendTimeout. We had issues with power saving. We powered the idle nodes off, causing a full boot to resume. We observed repeatedly the strange behaviour that the node is present for a while, but only detected by slurmctld as being ready right when it is giving up with SuspendTimeout. But instead of fixing this possibly subtle logic error, we figured that a) The node suspend support in Slurm was not really designed for full power off/on, which can take minutes regularily. b) This functionality of taking nodes out of/into production is something the cluster admin does. This is not in the scope of the batch system. Hence I wrote a script that runs as a service on a superior admin node. It queries Slurm for idle nodes and pending jobs and then decides which nodes to drain and then power down or bring back online. This needs more knowledge on Slurm job and node states than I'd like, but it works. Ideally, I'd like the powersaving feature of slurm consisting of a simple interface that can communicate 1. which nodes are probably not needed in the coming x minutes/hours, depending on the job queue, with settings like keeping a minimum number of nodes idle, and 2. which nodes that are currently drained/offline it could use to satisfy user demand. I imagine that Slurm upstream is not very keen on hashing out a robust interface for that. I can see arguments for keeping this wholly internal to Slurm, but for me, taking nodes in/out of production is not directly a batch system's task. Obviously the integration of power saving that involves nodes really being powered down brings complications like the strange ResumeTimeout behaviour. Also, in the case of node that have trouble getting back online, the method inside Slurm provides for a bad user experience: The nodes are first allocated to the job, and _then_ they are powered up. In the worst case of a defective node, Slurm will wait for the whole SuspendTimeout just to realize that it doesn't really have the resources it just promised to the job, making the job run attempt fail needlessly. With my external approach, the handling of bringing a node back up is done outside slurmctld. Only after a node is back, it is undrained and jobs will be allocated on it. I use a draining with a specific reason to mark nodes that are offline due to power saving. What sucks is that I have to implement part of the scheduler in the sense that I need to match pending jobs' demands against properties of available nodes. Maybe the internal powersaving could be made more robust, but I would rather like to see more separation of concerns than putting everything into one box. Things are too intertangled, even with my simple concept of 'job' not beginning to describe what Slurm has in terms of various steps as scheduling entities that by default also use delayed allocation techniques (regarding prolog script behaviour, for example). Alrighty then, Thomas -- Dr. Thomas Orgis HPC @ Universität Hamburg
Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram
Am Mon, 27 Mar 2023 13:17:01 +0200 schrieb Ole Holm Nielsen : > FYI: Slurm power_save works very well for us without the issues that you > describe below. We run Slurm 22.05.8, what's your version? I'm sure that there are setups where it works nicely;-) For us, it didn't, and I was faced with hunting the bug in slurm or working around it with more control, fixing the underlying issue of the node resume script being called _after_ the job has been allocated to the node. That is too late in case of node bootup failure and causes annoying delays for users only to see jobs fail. We do run 21.08.8-2, which means any debugging of this on the slurm side would mean upgrading first (we don't upgrade just for upgrade's sake). And, as I said: The issue of the wrong timing remains, unless I try deeper changes in slurm's logic. The other issue is that we had a kludge in place, anyway, to enable slurmctld to power on nodes via IPMI. The machine slurmctld runs on has no access to the IPMI network itself, so we had to build a polling communication channel to the node which has this access (and which is on another security layer, hence no ssh into it). For all I know, this communication kludge is not to blame, as, in the spurious failures, the nodes did boot up just fine and were ready. Only slurmctld decided to let the timeout pass first, then recognize that the slurmd on the node is there, right that instant. Did your power up/down script workflow work with earlier slurm versions, too? Did you use it on bare metal servers or mostly on cloud instances? Do you see a chance for a) fixing up the internal powersaving logic to properly allocating nodes to a job only when these nodes are actually present (ideally, with a health check passing) or b) designing an interface between slurm as manager of available resources and another site-specific service responsible for off-/onlining resources that are known to slurm, but down/drained? My view is that Slurm's task is to distribute resources among users. The cluster manager (person or $MIGHTY_SITE_SPECIFIC_SOFTWARE) decides if a node is currently available to Slurm or down for maintenance, for example. Power saving would be another reason for a node being taken out of service. Maybe I got an old-fashioned minority view … Alrighty then, Thomas PS: I guess solution a) above goes against Slurm's focus on throughput and avoiding delays caused by synchronization points, while our idea here is that batch jobs where that matters should be written differently, packing more than a few seconds worth of work into each step. -- Dr. Thomas Orgis HPC @ Universität Hamburg
Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram
Am Wed, 29 Mar 2023 14:42:33 +0200 schrieb Ben Polman : > I'd be interested in your kludge, we face a similar situation where the > slurmctld node > does not have access to the ipmi network and can not ssh to machines > that have access. > We are thinking on creating a rest interface to a control server which > would be running the ipmi commands We settled on transient files in /dev/shm on the slurmctld side as "API". You could call it in-memory transactional database;-) #!/bin/sh # node-suspend and node-resume (symlinked) script powerdir=/dev/shm/powersave scontrol=$(cd "$(dirname "$0")" && pwd)/scontrol hostlist=$1 case $0 in *-suspend) subdir=suspend ;; *-resume) subdir=resume ;; esac mkdir -p "$powerdir/$subdir" && cd "$powerdir/$subdir" && tmp=$(mktemp XXX.tmp) && $scontrol show hostnames "$hostlist" > "$tmp" && echo "$(date +%Y%m%d-%H%M%S) $(basename $0) $(cat "$tmp"|tr '\n' ' ')" >> $powerdir/log mv "$tmp" "${tmp%.tmp}.list" # end This atomically creates powersave/suspend/*.list and powersave/resume/*.list files with node names in them. On the priviledged server, a script periodically looked at the directories (via ssh) and triggered the appropriate actions, including some heuristics about unlcean shutdowns or spontaneous re-availability (with a thousand runs, there's a good chance for something getting stuck, in some driver code, even). #!/bin/sh powerdir=/dev/shm/powersave batch() { ssh-wrapper-that-correctly-quotes-argument-list --host=batchhost "$@" } while sleep 5 do suspendlists=$(batch ls "$powerdir/suspend/" 2>/dev/null | grep '.list$') for f in $suspendlists do hosts=$(batch cat "$powerdir/suspend/$f" 2>/dev/null) for h in $hosts do case "$h" in node*|data*) echo "suspending $h" node-shutdown-wrapper "$h" ;; *) echo "malformed node name" ;; esac done batch rm -f "$powerdir/suspend/$f" done resumelists=$(batch ls $powerdir/resume/ 2>/dev/null | grep '.list$') for f in $resumelists do hosts=$(batch cat "$powerdir/resume/$f" 2>/dev/null) for h in $hosts do - case "$h" in node*) echo "resuming $h" # Assume the node _should_ be switched off. Ensure that now (in # case it hung during shutdown). if ipmi-wrapper "$h" chassis power status|grep -q on$; then if ssh -o ConnectTimeout=2 "$h" pgrep slurmd >/dev/null 2>&1
Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram
h more log lines around the event: [2022-11-18T18:37:56.437] node node355 not resumed by ResumeTimeout(300) - marking down and power_save [2022-11-18T18:37:56.437] requeue job JobId=2522959_9(2523030) due to failure of node node355 [2022-11-18T18:37:56.437] Requeuing JobId=2522959_9(2523030) [2022-11-18T18:38:07.290] Node node355 now responding [2022-11-18T18:38:07.290] node node355 returned to service [2022-11-18T18:39:08.125] error: Nodes node355 not responding [2022-11-18T18:39:52.551] _job_complete: JobId=2522959_24(2523045) WEXITSTATUS 0 [2022-11-18T18:39:52.551] _job_complete: JobId=2522959_24(2523045) done [2022-11-18T18:40:10.834] sched/backfill: _start_job: Started JobId=2522959_9(2523030) in gpd on node355 […] [2022-11-18T19:00:02.916] _job_complete: JobId=2522959_9(2523030) WEXITSTATUS 0 [2022-11-18T19:00:02.916] _job_complete: JobId=2522959_9(2523030) done Here, the job got requeued. So maybe mainly unnecessary noise. I see that the failure to start the job occured at 18:37:56.437, while the node came back 10 seconds later. Then went out a minute later, came back … maybe network weirdness. Ah, digging in my own messages to the users, I see that we had two distinct issues: 1. Jobs being aborted with NODE_FAIL because: “batch system asks to resume nodes that aren't actally suspended. This triggered a workaround [for] issues during powerdown, doing a power cycle on them.” 2. Requeued jobs like the above. So, point 1 was a bad interaction of slurmctld resume script with my mechanism of reboot. For $reasons, nodes could get stuck on shutdown (boot, actually), and my resume script cycled the power to ensure a fresh start if a node is powered on. I then modified it to only do that if there is no live instance of slurmd on the node. I never figured out why a resume request got issued for nodes that are not down, but it seems to be a thing that supposedly can happen. Point 2 could be valid behaviour for nodes that just need a bit too long … or spuriously triggered by slurmd messages for getting back into service having congestion issues. I did not investigate this further, rather avoided this tight coupling of a reboots of possibly flaky hardware (ages 6 and up) and scheduled jobs. Alrighty then, Thomas -- Dr. Thomas Orgis HPC @ Universität Hamburg
[slurm-users] Trying to using PMIx via srun for openmpi + pmix 5.0.7, PMIX ERROR: PMIX_ERR_FILE_OPEN_FAILURE in file gds_shmem2.c at line 1056 (but works with --mpi=pmix_v3?!)
unch_tasks srun: launching StepId=671133.4 on host n164, 1 tasks: 0 srun: topology/tree: init: topology tree plugin loaded srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: Node n164, 1 tasks started hello world from processor n164, rank 0 out of 1 srun: Received task exit notification for 1 task of StepId=671133.4 (status=0x). srun: n164: task 0: Completed srun: debug: task 0 done srun: debug: IO thread exiting srun: debug: mpi/pmix_v3: _conn_readable: (null) [0]: pmixp_agent.c:109: false, shutdown srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:363: Abort thread exit How can a PMIx 5 MPI even work with the pmix_v3 plugin? Why does it _not_ work with the pmix_v5 plugin? I am also curious why the plugins don't link to the respective libpmix (are they using dlopen for their dependencies? Why?). $ ldd /syssw/slurm/24.11.4/lib/slurm/mpi_pmix*.so /syssw/slurm/24.11.4/lib/slurm/mpi_pmix.so: linux-vdso.so.1 (0x7ffd19ffb000) libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x14d199e56000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x14d199c7) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x14d199b9) /lib64/ld-linux-x86-64.so.2 (0x14d199edd000) /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v3.so: linux-vdso.so.1 (0x7ffd265f2000) libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x1553902c8000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x1553900e2000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x155390002000) /lib64/ld-linux-x86-64.so.2 (0x15539035) /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v5.so: linux-vdso.so.1 (0x7ffd862b7000) libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x145adc36d000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x145adc187000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x145adc0a7000) /lib64/ld-linux-x86-64.so.2 (0x145adc3f4000) But they do have the proper RPATH set up: $ readelf -d /syssw/slurm/24.11.4/lib/slurm/mpi_pmix*.so | grep -e ^File -e PATH File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix.so 0x000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/5.0.7/lib] File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v3.so 0x000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/3.2.5/lib] File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v5.so 0x000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/5.0.7/lib] Which is important, since libpmix doesn't get sensible SONAME versioning (supposing they are supposed to be separate ABIs): $ find /syssw/pmix/* -name 'libpmix.so*' /syssw/pmix/3.2.5/lib/libpmix.so /syssw/pmix/3.2.5/lib/libpmix.so.2.2.35 /syssw/pmix/3.2.5/lib/libpmix.so.2 /syssw/pmix/5.0.7/lib/libpmix.so /syssw/pmix/5.0.7/lib/libpmix.so.2.13.7 /syssw/pmix/5.0.7/lib/libpmix.so.2 It's all libpmix.so.2. My mpihello program uses the 5.0.7 one, at least: $ ldd mpihello linux-vdso.so.1 (0x7fff1f13f000) libmpi.so.40 => /sw/env/gcc-13.3.0/openmpi/5.0.7/lib/libmpi.so.40 (0x14ed89d95000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x14ed89baf000) libopen-pal.so.80 => /sw/env/gcc-13.3.0/openmpi/5.0.7/lib/libopen-pal.so.80 (0x14ed89a25000) libfabric.so.1 => /syssw/fabric/1.21.0/lib/libfabric.so.1 (0x14ed8989b000) libefa.so.1 => /lib/x86_64-linux-gnu/libefa.so.1 (0x14ed8988d000) libibverbs.so.1 => /lib/x86_64-linux-gnu/libibverbs.so.1 (0x14ed8986c000) libpsm2.so.2 => /syssw/psm2/12.0.1/lib/libpsm2.so.2 (0x14ed89804000) libatomic.so.1 => /sw/compiler/gcc-13.3.0/lib64/libatomic.so.1 (0x14ed897fb000) libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x14ed8976a000) libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x14ed89745000) libpmix.so.2 => /syssw/pmix/5.0.7/lib/libpmix.so.2 (0x14ed8951e000) libevent_core-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_core-2.1.so.7 (0x14ed894e8000) libevent_pthreads-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x14ed894e3000) libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x14ed89486000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x14ed893a6000) /lib64/ld-linux-x86-64.so.2 (0x14ed8a0d8000) libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x14ed89397000) Can someone shed light on how the differing PMIx plugins are supposed to work? Can someone share a setup where pmix_v5 does work with openmpi 5.x? Alrighty then, Thomas -- Dr. Thomas Orgis HPC @ Universität Hamburg -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com