[slurm-users] sacct does always print all jobs regardless filter parameters with accounting_storage/filetxt

2020-01-29 Thread Dr. Thomas Orgis
Hi,

I happen to run a small cluster that doesn't use a slurmdbd but the
plain log in a text file, accounting_storage/filetxt in slurm.conf. It
is running Ubuntu 16.04 LTS and the slurm 15.08.7 provided with that.

Can someone tell me if it is normal (or at least a known bug) that with
this setup,

sacct -X -N node04

always returns the full list of all jobs recorded in the accounting
file? Likewise for filtering for start/endtime with -S/-E, no effect at
all. It is also apparent that it really takes a long time for sacct to
digest the text file, even if it is reduced to less than 5 jobs and
about 6M in size. To give a figure: 24 seconds user time, keeping one
core of a Xeon E5-2609 v2 @ 2.5 GHz busy.


Alrighty then,

Thomas

PS: Please no pointers about better running a proper database with
slurmdbd … I know that that works;-)

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg



Re: [slurm-users] sacct does always print all jobs regardless filter parameters with accounting_storage/filetxt

2020-01-30 Thread Dr. Thomas Orgis
Am Thu, 30 Jan 2020 19:07:59 +0300
schrieb mercan : 

>   Note: The filetxt plugin records only a limited subset of accounting 
> information and will prevent some sacct options from proper operation.

Thank you for looking this up. But since the filetxt does contain the
start/end timestamps and the nodes the job ran on, it is strange that
sacct should not be able to filter on those criteria.

This is an example line of the accounting file, with obvious
identifying fields replaced by :

 batch 1548429637 1548429637   - - 0  1 4294536312 48 
node[09-15,22] (null)

So, matching for job ID, user name (via numerical uid lookup),
timestamps and the nodes should be possible, it's all there.

Can someone confirm that it indeed is the case that _none_ of the
filtering options of sacct are supposed to work on filetxt?


Alrighty then,

Thomas
-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg



Re: [slurm-users] sacct does always print all jobs regardless filter parameters with accounting_storage/filetxt

2020-01-30 Thread Dr. Thomas Orgis
Am Thu, 30 Jan 2020 19:03:38 +0100
schrieb "Dr. Thomas Orgis" : 

>  batch 1548429637 1548429637   - - 0  1 4294536312 
> 48 node[09-15,22] (null)
> 
> So, matching for job ID, user name (via numerical uid lookup),
> timestamps and the nodes should be possible, it's all there.
> 
> Can someone confirm that it indeed is the case that _none_ of the
> filtering options of sacct are supposed to work on filetxt?

Matching for user (-u) and Job ID (-j) works, but not -N/-S/-E. So is
this just the current state and it's up to me to provide a patch to
enable it if I want that behaviour?

Alrighty then,

Thomas
-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg



Re: [slurm-users] sacct does always print all jobs regardless filter parameters with accounting_storage/filetxt

2020-02-02 Thread Dr. Thomas Orgis
Am Fri, 31 Jan 2020 20:57:16 -0800
schrieb Chris Samuel : 

> You're using a very very very old version of slurm there (15.08)

Well, that's what happens when an application gets into the mainstream
and is included in the OS distribution. On this cluster, we just try to
run with what Ubuntu LTS gives us. And I guess it is not atypical to
keep a base OS release during the lifetime of a cluster, while this one
already had 2 upgrades to newer LTS (another one probably should ivolve
a rebuilding of the cluster) …

You should get accustomed to people keeping asking questions about
_really_ outdated Slurm versons;-)

> should upgrade to a recent one (I'd suggest 19.05.5) to check whether 
> it's been fixed in the intervening years.

But this seems to confirm my suspicion that at least people were not
concerned that much with the simple text storage … otherwise it would
be better known if this works or not.

I tried running sacct from 19.05.5 on our config without touching the
running instance of Slurm and it at least doesn't complain.

The only difference I can see is that the default changed from showing
all jobs in the history to showing none. Adding -u and -j works, but
-N/-S/-E are still ignored. So apparently, during development of Slurm
the journey quickly moved on to database storage before people got
annoyed by long job lists, or the functionality got removed accidentally
after the plain text storage went out of fashion.

Well, since running a modern sacct directly on the old system works, I
might get around to hack in the missing matches. Any helpful pointers
before I blindly dig into the code?


Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg



[slurm-users] Access to slurm job cgroups in prolog/epilog script

2021-03-08 Thread Dr. Thomas Orgis
Hi,

I am wondering about the exact execution order of prolog scripts and
plugins in Slurm with the goal to be able to access the freshly created
cgroups (by the task/cgroup plugin) in our prolog/epilog scripts, which
run with PrologFlags=Alloc to ensure the tranditional batch system
behaviour.

We want some information, namely, the prepared cpuset for the job in
the prolog and the statistics/counter differences in the epilog. I am
aware of accounting and profiling options using slurm and plugins, but
there are reasons I want to handle cgroup information myself; maybe
even to experiment with things that might go into a slurm plugin at
some point.

The job cgroups are created after the prolog scripts ran and destroyed
before the epilog scripts run (correct? — looks like that). The design
seems to focus on individual job steps, having things run closely coupled
to the possibly multiple componentes (steps, tasks) for batch jobs that
I only have a hazy concept of.

Is there a standard way to get the cgroup hierarchy for the job created
early, before the per-node prolog script that runs as root (slurmd
user), and final cleanup happening later, after the epilog ran? If
config doesn't do it, I thought about modifying task/cgroup, but I
suspect that the whole scope of the plugin is between the epilogs. Can
someone confirm that? I welcome pointers to documentation that explains
in detail when which parts of a plugin is run in relation to the slot
the prolog scripts get.

With https://bugs.schedmd.com/show_bug.cgi?id=9429, there seems to be a
way to keep the cgroup around longer, just sabotage the cleanup phase
and do it later in the epilog (as I do now on a Ubuntu 20.04 cluster
with the distro-provided slurmd that suffers from this bug). But will
e.g. moving the code from task_p_pre_setuid() to
task_p_slurmd_reserve_resources() give me early access in the prolog? I
might just try and break something, but I didn't find yet documentation
on these details of the plugin API and for once thought asking around
first might be also good.

I want cpuset information and at least things like per-node memory
high-water marks. The desired granularity is at the job level and it
would be nice to get rid of inefficient timeseries to approximate that.
The cpuset is needed in advance to user programs starting as I hook a
listener to the taskstats interface to cheaply and accurately account
for user processes (kernel tasks) with command names. My profiling is
somewhere between the hdf5 timeseries and the rought values you get out
of sacct, with an orthogonal bit about kernel tasks (to tell the user
how many python processes wasted how much memory each).


Alrighty then,

Thomas

PS: I guess lots is possible by writing a custom plugin that ties in
with what my prolog/epilog scripts do, but I'd prefer a light touch
first. Hacking the scripts during development is far more convenient.

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg



Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-27 Thread Dr. Thomas Orgis
Am Mon, 06 Mar 2023 13:35:38 +0100
schrieb Stefan Staeglich :

> But this fixed not the main error but might have reduced the frequency of 
> occurring. Has someone observed similar issues? We will try a higher 
> SuspendTimeout.

We had issues with power saving. We powered the idle nodes off, causing
a full boot to resume. We observed repeatedly the strange behaviour
that the node is present for a while, but only detected by slurmctld as
being ready right when it is giving up with SuspendTimeout.

But instead of fixing this possibly subtle logic error, we figured that

a) The node suspend support in Slurm was not really designed for full
   power off/on, which can take minutes regularily.

b) This functionality of taking nodes out of/into production is
   something the cluster admin does. This is not in the scope of the
   batch system.

Hence I wrote a script that runs as a service on a superior admin node.
It queries Slurm for idle nodes and pending jobs and then decides which
nodes to drain and then power down or bring back online.

This needs more knowledge on Slurm job and node states than I'd like,
but it works. Ideally, I'd like the powersaving feature of slurm
consisting of a simple interface that can communicate

1. which nodes are probably not needed in the coming x minutes/hours,
   depending on the job queue, with settings like keeping a minimum number
   of nodes idle, and
2. which nodes that are currently drained/offline it could use to satisfy
   user demand.

I imagine that Slurm upstream is not very keen on hashing out a robust
interface for that. I can see arguments for keeping this wholly
internal to Slurm, but for me, taking nodes in/out of production is not
directly a batch system's task. Obviously the integration of power
saving that involves nodes really being powered down brings
complications like the strange ResumeTimeout behaviour. Also, in the
case of node that have trouble getting back online, the method inside
Slurm provides for a bad user experience:

The nodes are first allocated to the job, and _then_ they are powered
up. In the worst case of a defective node, Slurm will wait for the
whole SuspendTimeout just to realize that it doesn't really have the
resources it just promised to the job, making the job run attempt fail
needlessly.

With my external approach, the handling of bringing a node back up is
done outside slurmctld. Only after a node is back, it is undrained and
jobs will be allocated on it. I use a draining with a specific reason
to mark nodes that are offline due to power saving. What sucks is that
I have to implement part of the scheduler in the sense that I need to
match pending jobs' demands against properties of available nodes.

Maybe the internal powersaving could be made more robust, but I would
rather like to see more separation of concerns than putting everything
into one box. Things are too intertangled, even with my simple concept
of 'job' not beginning to describe what Slurm has in terms of various
steps as scheduling entities that by default also use delayed
allocation techniques (regarding prolog script behaviour, for example).


Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg



Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Dr. Thomas Orgis
Am Mon, 27 Mar 2023 13:17:01 +0200
schrieb Ole Holm Nielsen :

> FYI: Slurm power_save works very well for us without the issues that you 
> describe below.  We run Slurm 22.05.8, what's your version?

I'm sure that there are setups where it works nicely;-) For us, it
didn't, and I was faced with hunting the bug in slurm or working around
it with more control, fixing the underlying issue of the node resume
script being called _after_ the job has been allocated to the node.
That is too late in case of node bootup failure and causes annoying
delays for users only to see jobs fail.

We do run 21.08.8-2, which means any debugging of this on the slurm
side would mean upgrading first (we don't upgrade just for upgrade's
sake). And, as I said: The issue of the wrong timing remains, unless I
try deeper changes in slurm's logic. The other issue is that we had a
kludge in place, anyway, to enable slurmctld to power on nodes via
IPMI. The machine slurmctld runs on has no access to the IPMI network
itself, so we had to build a polling communication channel to the node
which has this access (and which is on another security layer, hence no
ssh into it). For all I know, this communication kludge is not to
blame, as, in the spurious failures, the nodes did boot up just fine
and were ready. Only slurmctld decided to let the timeout pass first,
then recognize that the slurmd on the node is there, right that instant.

Did your power up/down script workflow work with earlier slurm
versions, too? Did you use it on bare metal servers or mostly on cloud
instances?

Do you see a chance for

a) fixing up the internal powersaving logic to properly allocating
   nodes to a job only when these nodes are actually present (ideally,
   with a health check passing) or
b) designing an interface between slurm as manager of available
   resources and another site-specific service responsible for off-/onlining
   resources that are known to slurm, but down/drained?

My view is that Slurm's task is to distribute resources among users.
The cluster manager (person or $MIGHTY_SITE_SPECIFIC_SOFTWARE) decides
if a node is currently available to Slurm or down for maintenance, for
example. Power saving would be another reason for a node being taken
out of service.

Maybe I got an old-fashioned minority view …


Alrighty then,

Thomas

PS: I guess solution a) above goes against Slurm's focus on throughput
and avoiding delays caused by synchronization points, while our idea here
is that batch jobs where that matters should be written differently,
packing more than a few seconds worth of work into each step.

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg



Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Dr. Thomas Orgis
Am Wed, 29 Mar 2023 14:42:33 +0200
schrieb Ben Polman :

> I'd be interested in your kludge, we face a similar situation where the 
> slurmctld node
> does not have access to the ipmi network and can not ssh to machines 
> that have access.
> We are thinking on creating a rest interface to a control server which 
> would be running the ipmi commands

We settled on transient files in /dev/shm on the slurmctld side as
"API". You could call it in-memory transactional database;-)

#!/bin/sh
# node-suspend and node-resume (symlinked) script

powerdir=/dev/shm/powersave
scontrol=$(cd "$(dirname "$0")" && pwd)/scontrol
hostlist=$1

case $0 in
*-suspend)
  subdir=suspend
;;
*-resume)
  subdir=resume
;;
esac

mkdir -p "$powerdir/$subdir" &&
cd "$powerdir/$subdir" &&
tmp=$(mktemp XXX.tmp) &&
$scontrol show hostnames "$hostlist" > "$tmp" &&
echo "$(date +%Y%m%d-%H%M%S) $(basename $0) $(cat "$tmp"|tr '\n' ' ')" >> 
$powerdir/log
mv "$tmp" "${tmp%.tmp}.list"
# end

This atomically creates powersave/suspend/*.list and
powersave/resume/*.list files with node names in them.

On the priviledged server, a script periodically looked at the directories
(via ssh) and triggered the appropriate actions, including some
heuristics about unlcean shutdowns or spontaneous re-availability (with
a thousand runs, there's a good chance for something getting stuck, in
some driver code, even).

#!/bin/sh

powerdir=/dev/shm/powersave

batch()
{
  ssh-wrapper-that-correctly-quotes-argument-list --host=batchhost "$@"
}

while sleep 5
do
  suspendlists=$(batch ls "$powerdir/suspend/" 2>/dev/null | grep '.list$')
  for f in $suspendlists
  do
hosts=$(batch cat "$powerdir/suspend/$f" 2>/dev/null)
for h in $hosts
do
  case "$h" in
  node*|data*)
echo "suspending $h"
node-shutdown-wrapper "$h"
  ;;
  *)
echo "malformed node name"
  ;;
  esac
done
batch rm -f "$powerdir/suspend/$f"
  done
  resumelists=$(batch ls $powerdir/resume/ 2>/dev/null | grep '.list$')
  for f in $resumelists
  do
hosts=$(batch cat "$powerdir/resume/$f" 2>/dev/null)
for h in $hosts
do
-  case "$h" in
  node*)
echo "resuming $h"
# Assume the node _should_ be switched off. Ensure that now (in
# case it hung during shutdown).
   if ipmi-wrapper "$h" chassis power status|grep -q on$; then
  if ssh -o ConnectTimeout=2 "$h" pgrep slurmd >/dev/null 2>&1 


Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-04-03 Thread Dr. Thomas Orgis
h more log lines around the event:

[2022-11-18T18:37:56.437] node node355 not resumed by ResumeTimeout(300) - 
marking down and power_save
[2022-11-18T18:37:56.437] requeue job JobId=2522959_9(2523030) due to failure 
of node node355
[2022-11-18T18:37:56.437] Requeuing JobId=2522959_9(2523030)
[2022-11-18T18:38:07.290] Node node355 now responding
[2022-11-18T18:38:07.290] node node355 returned to service
[2022-11-18T18:39:08.125] error: Nodes node355 not responding
[2022-11-18T18:39:52.551] _job_complete: JobId=2522959_24(2523045) WEXITSTATUS 0
[2022-11-18T18:39:52.551] _job_complete: JobId=2522959_24(2523045) done
[2022-11-18T18:40:10.834] sched/backfill: _start_job: Started 
JobId=2522959_9(2523030) in gpd on node355
[…]
[2022-11-18T19:00:02.916] _job_complete: JobId=2522959_9(2523030) WEXITSTATUS 0
[2022-11-18T19:00:02.916] _job_complete: JobId=2522959_9(2523030) done

Here, the job got requeued. So maybe mainly unnecessary noise.

I see that the failure to start the job occured at 18:37:56.437, while
the node came back 10 seconds later. Then went out a minute later, came
back … maybe network weirdness.

Ah, digging in my own messages to the users, I see that we had two
distinct issues:

1. Jobs being aborted with NODE_FAIL because: “batch system asks to
resume nodes that aren't actally suspended. This triggered a workaround
[for] issues during powerdown, doing a power cycle on them.”

2. Requeued jobs like the above.

So, point 1 was a bad interaction of slurmctld resume script with my
mechanism of reboot. For $reasons, nodes could get stuck on shutdown
(boot, actually), and my resume script cycled the power to ensure a
fresh start if a node is powered on. I then modified it to only do that
if there is no live instance of slurmd on the node. I never figured out
why a resume request got issued for nodes that are not down, but it
seems to be a thing that supposedly can happen.

Point 2 could be valid behaviour for nodes that just need a bit too
long … or spuriously triggered by slurmd messages for getting back into
service having congestion issues. I did not investigate this further,
rather avoided this tight coupling of a reboots of possibly flaky
hardware (ages 6 and up) and scheduled jobs.


Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg



[slurm-users] Trying to using PMIx via srun for openmpi + pmix 5.0.7, PMIX ERROR: PMIX_ERR_FILE_OPEN_FAILURE in file gds_shmem2.c at line 1056 (but works with --mpi=pmix_v3?!)

2025-05-06 Thread Dr. Thomas Orgis via slurm-users
unch_tasks
srun: launching StepId=671133.4 on host n164, 1 tasks: 0
srun: topology/tree: init: topology tree plugin loaded
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: Node n164, 1 tasks started
hello world from processor n164, rank 0 out of 1
srun: Received task exit notification for 1 task of StepId=671133.4 
(status=0x).
srun: n164: task 0: Completed
srun: debug:  task 0 done
srun: debug:  IO thread exiting
srun: debug:  mpi/pmix_v3: _conn_readable: (null) [0]: pmixp_agent.c:109: 
false, shutdown
srun: debug:  mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:363: 
Abort thread exit

How can a PMIx 5 MPI even work with the pmix_v3 plugin? Why does it
_not_ work with the pmix_v5 plugin? I am also curious why the plugins
don't link to the respective libpmix (are they using dlopen for their
dependencies? Why?).

$ ldd /syssw/slurm/24.11.4/lib/slurm/mpi_pmix*.so
/syssw/slurm/24.11.4/lib/slurm/mpi_pmix.so:
linux-vdso.so.1 (0x7ffd19ffb000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 
(0x14d199e56000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x14d199c7)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x14d199b9)
/lib64/ld-linux-x86-64.so.2 (0x14d199edd000)
/syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v3.so:
linux-vdso.so.1 (0x7ffd265f2000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 
(0x1553902c8000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x1553900e2000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x155390002000)
/lib64/ld-linux-x86-64.so.2 (0x15539035)
/syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v5.so:
linux-vdso.so.1 (0x7ffd862b7000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 
(0x145adc36d000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x145adc187000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x145adc0a7000)
/lib64/ld-linux-x86-64.so.2 (0x145adc3f4000)

But they do have the proper RPATH set up:

$ readelf -d  /syssw/slurm/24.11.4/lib/slurm/mpi_pmix*.so | grep -e ^File -e 
PATH
File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix.so
 0x000f (RPATH)  Library rpath: 
[/syssw/hwloc/2.11.2/lib:/syssw/pmix/5.0.7/lib]
File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v3.so
 0x000f (RPATH)  Library rpath: 
[/syssw/hwloc/2.11.2/lib:/syssw/pmix/3.2.5/lib]
File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v5.so
 0x000f (RPATH)  Library rpath: 
[/syssw/hwloc/2.11.2/lib:/syssw/pmix/5.0.7/lib]

Which is important, since libpmix doesn't get sensible SONAME
versioning (supposing they are supposed to be separate ABIs):

$ find  /syssw/pmix/* -name 'libpmix.so*'
/syssw/pmix/3.2.5/lib/libpmix.so
/syssw/pmix/3.2.5/lib/libpmix.so.2.2.35
/syssw/pmix/3.2.5/lib/libpmix.so.2
/syssw/pmix/5.0.7/lib/libpmix.so
/syssw/pmix/5.0.7/lib/libpmix.so.2.13.7
/syssw/pmix/5.0.7/lib/libpmix.so.2

It's all libpmix.so.2. My mpihello program uses the 5.0.7 one, at least:

$ ldd mpihello
linux-vdso.so.1 (0x7fff1f13f000)
libmpi.so.40 => /sw/env/gcc-13.3.0/openmpi/5.0.7/lib/libmpi.so.40 
(0x14ed89d95000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x14ed89baf000)
libopen-pal.so.80 => 
/sw/env/gcc-13.3.0/openmpi/5.0.7/lib/libopen-pal.so.80 (0x14ed89a25000)
libfabric.so.1 => /syssw/fabric/1.21.0/lib/libfabric.so.1 
(0x14ed8989b000)
libefa.so.1 => /lib/x86_64-linux-gnu/libefa.so.1 (0x14ed8988d000)
libibverbs.so.1 => /lib/x86_64-linux-gnu/libibverbs.so.1 
(0x14ed8986c000)
libpsm2.so.2 => /syssw/psm2/12.0.1/lib/libpsm2.so.2 (0x14ed89804000)
libatomic.so.1 => /sw/compiler/gcc-13.3.0/lib64/libatomic.so.1 
(0x14ed897fb000)
libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 
(0x14ed8976a000)
libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 
(0x14ed89745000)
libpmix.so.2 => /syssw/pmix/5.0.7/lib/libpmix.so.2 (0x14ed8951e000)
libevent_core-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_core-2.1.so.7 
(0x14ed894e8000)
libevent_pthreads-2.1.so.7 => 
/lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x14ed894e3000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 
(0x14ed89486000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x14ed893a6000)
/lib64/ld-linux-x86-64.so.2 (0x14ed8a0d8000)
libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x14ed89397000)


Can someone shed light on how the differing PMIx plugins are supposed
to work? Can someone share a setup where pmix_v5 does work with openmpi
5.x?


Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com