[slurm-users] job_container/tmpfs and srun.

2024-01-09 Thread Phill Harvey-Smith

Hi all,

On our setup we are using job_container/tmpfs to give each job it's own 
temp space. Since our compute nodes have reasonably sized disks for 
tasks that do a lot of disk I/O on user's data we have asked users to 
copy their data to the local disk at the beginning of the task and (if 
needed) copy it back at the end. This saves lots of NFS thrashing 
slowing down both the task and the NFS servers.


However some of our users are having problems with this, their initial 
sbatch script will create a temp directory in their private /tmp copy 
their data to it and then try to srun a program. The srun will fall over 
as it doesn't seem to have have access to the copied data. I suspect 
this is because the srun task is getting it's own private /tmp.


So my question is, is there a way to have the srun task inherit the /tmp 
of the initial sbatch?


I'll include a sample of the script our user is using below.

If any further information is required please feel free to ask.

Cheers.

Phill.


#!/usr/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:00:10
#SBATCH --mem-per-cpu=3999
#SBATCH --output=script_out.log
#SBATCH --error=script_error.log

# The above options puts the STDOUT and STDERR of sbatch in
# log files prefixed with 'script_'.

# Create a randomly-named directory under /tmp
jobtmpdir=$(mktemp -d)

# Register a function to try and cleanup in case of job failure
cleanup_handler()
{
echo "Cleaning up ${jobtmpdir}"
rm -rf ${jobtmpdir}
}
trap 'cleanup_handler' SIGTERM EXIT

# Change working directory to this directory
cd ${jobtmpdir}

# Copy the executable and input files from
# where the job was submitted to the temporary directory.
cp ${SLURM_SUBMIT_DIR}/a.out .
cp ${SLURM_SUBMIT_DIR}/input.txt .

# Run the executable, handling the collection of stdout
# and stderr ourselves by redirecting to file
srun ./a.out 2> task_error.log > task_out.log

# Copy output data back to the submit directory.
cp output.txt ${SLURM_SUBMIT_DIR}
cp task_out.log ${SLURM_SUBMIT_DIR}
cp task_error.log ${SLURM_SUBMIT_DIR}

# Cleanup
cd ${SLURM_SUBMIT_DIR}
cleanup_handler



Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error

2024-01-09 Thread Timony, Mick
You could enable debug logging on your slurm controllers to see if that 
provides some more useful info. I'd also check your firewall settings to make 
sure your not blocking some traffic that you shouldn't. iptables -F​ will clear 
your local Linux firewall.

I'd also triple check the UID on all the systems and run this on all your 
compute nodes, slurm controllers, and slurmdb to make sure it is the same! 🙂

id 5​

I'd also restart all the slurm daemons all the systems to make sure that you 
don't have systems that running a daemon from before you created UID 5 as 
running processes often don't pick up changes like that unless they're 
restarted.


Cheers
--
Mick Timony
Senior DevOps Engineer
Harvard Medical School
--


From: slurm-users  on behalf of Craig 
Stark 
Sent: Monday, January 8, 2024 5:46 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error

This ticket with SchedMD implies it's a munged issue:

https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=1293__;!!CzAuKJ42GuquVTTmVmPViYEvSg!N2M1a84yfU8mhdQ87LnBMQxye_nBsrTzTow7spIqZaQ2dLevBDZy4oNMT8KzMsmhxdRwchIht3Tgl3p8cMHhFOg9ry546OQ_iA$

Is the munge daemon running on all systems? If it is, are all servers running a 
network time daemon such chronyd or ntpd and the time is in sync on all hosts?
Thanks Mick,

munge is seemingly running on all systems (systemctl status munge).  I do get a 
warning about the munge file changing on disk, but I'm pretty sure that's from 
warewulf sync'ing files every minute.  A sha256sum on the munge.key file on the 
compute nodes and host node says they're the same, so I think I can put that 
aside.

The management node runs chrony and the compute nodes sync to the management 
node.
[root@kirby uber]# chronyc tracking
Reference ID: 4A06A849 (t2.time.gq1.yahoo.com)
Stratum : 3
Ref time (UTC)  : Mon Jan 08 22:26:44 2024
System time : 0.32525 seconds slow of NTP time
Last offset : -0.21390 seconds
RMS offset  : 0.55729 seconds
Frequency   : 38.797 ppm slow
Residual freq   : +0.001 ppm
Skew: 0.018 ppm
Root delay  : 0.033342984 seconds
Root dispersion : 0.000524800 seconds
Update interval : 256.8 seconds
Leap status : Normal

vs
[root@sonic01 ~]# chronyc tracking
Reference ID: C0A80102 (warewulf)
Stratum : 4
Ref time (UTC)  : Mon Jan 08 22:31:02 2024
System time : 0.00120 seconds slow of NTP time
Last offset : -0.00092 seconds
RMS offset  : 0.14737 seconds
Frequency   : 47.495 ppm slow
Residual freq   : +0.000 ppm
Skew: 0.066 ppm
Root delay  : 0.033458963 seconds
Root dispersion : 0.000283949 seconds
Update interval : 64.2 seconds
Leap status : Normal

So, the compute node is talking to the host and the host is talking to generic 
NTP sources.  "date" shows the same time on the compute nodes


[slurm-users] Beginner admin question: Prioritization within a partition based on time limit

2024-01-09 Thread Kenneth Chiu
I'm just learning about slurm. I understand that different different
partitions can be prioritized separately, and can have different max time
limits. I was wondering whether or not there was a way to have a
finer-grained prioritization based on the time limit specified by a job,
within a single partition. Or perhaps this is already happening by default?
Would the backfill scheduler be best for this?


Re: [slurm-users] Beginner admin question: Prioritization within a partition based on time limit

2024-01-09 Thread Paul Edmon
Yeah, that's sort of the job of the backfill scheduler, as smaller jobs 
will fit better into the gaps. There are several options with in the 
priority framework that you can use to dial in which jobs get which 
priority. I recommend reading through all those and finding the options 
that will work best for the policy you want to implement.


-Paul Edmon-

On 1/9/2024 10:43 AM, Kenneth Chiu wrote:
I'm just learning about slurm. I understand that different different 
partitions can be prioritized separately, and can have different max 
time limits. I was wondering whether or not there was a way to have a 
finer-grained prioritization based on the time limit specified by a 
job, within a single partition. Or perhaps this is already happening 
by default? Would the backfill scheduler be best for this?