Nielsen wrote:
>
> Hi Jeffrey,
>
> We run Slurm 24.05.5. I now used the RockyLinux 8.10 versions of gcc and
> cmake according to the instructions in README.md, but the build is still
> failing.
>
> /Ole
>
> On 07-01-2025 20:14, Jeffrey Frey wrote:
>> O
st/issues/1
> Can you help me out?
>
> On 07-01-2025 16:27, Jeffrey Frey via slurm-users wrote:
>> We use a tool that's compiled against the Slurm library itself so that the
>> expansion/contraction of lists is always 100% in sync with Slurm itself:
>> https://git
We use a tool that's compiled against the Slurm library itself so that the
expansion/contraction of lists is always 100% in sync with Slurm itself:
https://github.com/jtfrey/snodelist
> On Jan 7, 2025, at 10:12, Davide DelVento via slurm-users
> wrote:
>
> Wonderful. Thanks Ole for the re
27;s still not technically a
swap limit per se, because it entails the sum of swap + physical RAM usage, but
it has kept our nodes from getting starved out thanks to heavy swapping and can
scale with job size, etc.
/*!
@signature Jeffrey Frey, Ph.D
@email f...@udel.edu
@source iPhone
*/
&
Adding the "--details" flag to scontrol lookup of the job:
$ scontrol --details show job 1636832
JobId=1636832 JobName=R3_L2d
:
NodeList=r00g01,r00n09
BatchHost=r00g01
NumNodes=2 NumCPUs=60 NumTasks=60 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=60,mem=60G,node=2,billing=55350
Sock
If you check the source code (src/slurmctld/job_mgr.c) this error is indeed
thrown when slurmctl unpacks job state files. Tracing through
read_slurm_conf() -> load_all_job_state() -> _load_job_state():
part_ptr = find_part_record (partition);
if (part_ptr == NUL
Have you tried
srun -N# -n# mpirun python3
Perhaps you have no MPI environment being setup for the processes? There was
no "--mpi" flag in your "srun" command and we don't know if you have a default
value for that or not.
> On Jul 12, 2019, at 10:28 AM, Chris Samuel wrote:
it
> executes the task as test (i.e. the output is, by default, in /home/test and
> owned by test). I guess this is a bug?
>
> @Jeffrey Sorry, slurmdUser=sudo was a typo. Thanks a lot for the
> clarifications regarding the POSIX capabilities.
>
>
> On Tue, 9 Jul 2019 at
> So, if I understand this correctly, for some reason, `srun` does not need
> root privileges on the computation node side, but `sbatch` does when
> scheduling. I was afraid doing so would mean users could do things such as
> apt install and such, but it does not seem the case.
The critical par
The error message cited is associated with SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT,
which is only ever raised by slurm_send_timeout() and slurm_recv_timeout().
Those functions raise that error when a generic socket-based send/receive
operation exceeds an arbitrary time limit imposed by the caller.
If you're on Linux and using Slurm cgroups, your job processes should be
contained in a memory cgroup. The /proc//cgroup file indicates to which
cgroups a process is assigned, so:
$ srun [...] /bin/bash -c "grep memory: /proc/\$\$/cgroup | sed
's%^[0-9]*:memory:%/sys/fs/cgroup/memory%'"
/sys/
Just FYI, there's one minor issue with spart: for pending jobs, the
"partition" parameter can be a comma-separated list of partitions and not just
a single partition name
If the job can use more than one partition, specify their names in a comma
separate list and the one
offering earliest
Config details:
- Slurm v17.11.8
- QOS-based preemption
- Backfill scheduler (default parameters)
- QOS:
- "normal" = PreemptMode=CANCEL, GraceTime=5 minutes
- Per-stakeholder = Preempt=normal GrpTRES=
- Partitions:
- "standard" (default) = QOS=normal
- Per-stakeholder = QOS=
When users need prio
Also see "https://slurm.schedmd.com/slurm.conf.html"; for
MaxArraySize/MaxJobCount.
We just went through a user-requested adjustment to MaxArraySize to bump it
from 1000 to 1; as the documentation states, since each index of an array
job is essentially "a job," you must be sure to also adju
PU
>
> and then it will treat threads as CPUs and then it will let you start the
> number of tasks you expect
>
> Antony
>
> On Thu, 7 Feb 2019 at 18:04, Jeffrey Frey wrote:
> Your nodes are hyperthreaded (ThreadsPerCore=2). Slurm always allocates _all
> threads
Your nodes are hyperthreaded (ThreadsPerCore=2). Slurm always allocates _all
threads_ associated with a selected core to jobs. So you're being assigned
both threads on core N.
On our development-partition nodes we configure the threads as cores, e.g.
NodeName=moria CPUs=16 Boards=1 SocketsP
What does ulimit tell you on the compute node(s) where the jobs are running?
The error message you cited arises when a user has reached the per-user process
count limit (e.g. "ulimit -u"). If your Slurm config doesn't limit how many
jobs a node can execute concurrently (e.g. oversubscribe), th
When in doubt, check the source:
extern int select_g_select_nodeinfo_unpack(dynamic_plugin_data_t **nodeinfo,
Buf buffer,
uint16_t protocol_version)
{
dynamic_plugin_data_t *nodeinfo_ptr = NULL;
For MySQL to use a text column as a primary key, it requires a limit on how
many bytes are significant. Just check through
src/plugins/accounting_storage/mysql/accounting_storage_mysql.c and you'll see
lots of primary keys with "(20)" indexing lengths specified.
With an extant database you ma
I ran into this myself. By default Slurm allocates HT's as pairs (associated
with a single core). The only adequate way I figured out to force HT = core is
to make them full-fledged cores in the config:
NodeName=csk007 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=40
ThreadsPerCore=1 Rea
The TRESBillingWeights is only present at the level of the partition. When a
job executes, it does so in a single partition, and the billing is calculated
using that partition's TRESBillingWeights (possibly multiplied by the job's
QOS's UsageFactor).
> On Dec 5, 2018, at 11:33 AM, Jacob Ch
See the documentation at
https://slurm.schedmd.com/heterogeneous_jobs.html#env_var
There are *_PACK_* environment variables in the job env that describe the
heterogeneous allocation. The batch step of the job (corresponding to your
script) executes on the first node of the first part
If you check the sbatch man page, there's no similar variable listed for the
job environment. You can:
(1) write/add to a spank plugin to set that in the job environment
(2) implement a patch yourself and submit it to SchedMD
(3) submit a request to SchedMD (if you have a contract) to have th
Make sure you're using RSA keys in users' accounts -- we'd started setting-up
ECDSA on-cluster keys as we built our latest cluster but libssh at that point
didn't support them. And since the Slurm X11 plugin is hard-coded to only use
~/.ssh/id_rsa, that further tied us to RSA. It would be nice
You could reconfigure the partition node lists on the fly using scontrol:
$ scontrol update PartitionName=regular_part1 Nodes=
:
$ scontrol update PartitionName=regular_partN Nodes=
$ scontrol update PartitionName=maint Nodes=r00n00
Should be easy enough to write a script that find the parti
You're mixing up --Format and --format. The latter uses printf-like "%"
syntax:
squeue --format="%20u"
and the former:
squeue --Format="username:20"
> On Aug 29, 2018, at 11:39 AM, Mahmood Naderan wrote:
>
> Hi
> I want to make the user column larger than the default
SLURM_NTASKS is only unset when no task count flags are handed to salloc (no
--ntasks, --ntasks-per-node, etc.). Can't you then assume if it's not present
in the environment you've got a single task allocated to you? So in your
generic starter script instead of using SLURM_NTASKS itself, use a
Check the Gaussian log file for mention of its using just 8 CPUs-- just because
there are 12 CPUs available doesn't mean the program uses all of them. It will
scale-back if 12 isn't a good match to the problem as I recall.
/*!
@signature Jeffrey Frey, Ph.D
@email f...@udel.edu
> Jeffrey: It would be very nice if you could document in detail how to
> configure opa2slurm and list all prerequisite RPMs in your README.md.
Added to the README.md: build info and usage info.
::
Jeffrey T. Frey, Ph.D.
Systems Programmer V
mgt-devel packages.
>
> So, since you seem to use the same version as me, I'm not sure why you have
> these linking problems :/
>
>
> Best
> Marcus
>
> On 06/14/2018 09:17 AM, Ole Holm Nielsen wrote:
>> Hi Jeffrey,
>>
>> On 06/13/2018 10:35 PM,
> Thanks Jeffrey for that tool, for me it is working. I changed a little bit
> the CMakeLists.txt such that slurm can be found also in non standard install
> paths ;)
>
> replaced
> SET (SLURM_PREFIX "/usr/local" CACHE PATH "Directory in which SLURM is
> installed.")
> with
> FIND_PATH(SLURM_PR
Intel's OPA doesn't include the old IB net discovery library/API; instead, they
have their own library to enumerate nodes, links, etc. I've started a rewrite
of ye olde "ib2slurm" utility to make use of Intel's new enumeration library.
https://gitlab.com/jtfrey/opa2slurm
E.g.
$ opa
Every cluster I've ever managed has this issue. Once cgroup support arrived in
Linux, the path we took (on CentOS 6) was to use the 'cgconfig' and 'cgred'
services on the login node(s) to setup containers for regular users and
quarantine them therein. The config left 4 CPU cores unused by regu
I had to figure this one out myself a few months ago. See
src/common/plugstack.c:
/*
* Do not load static plugin options table in allocator context.
*/
if (stack->type != S_TYPE_ALLOCATOR)
plugin->opts = plugin_get_sym(p, "spank_options");
In
Don't propagate the submission environment:
srun --export=NONE myprogram
> On Dec 19, 2017, at 8:37 AM, Yair Yarom wrote:
>
>
> Thanks for your reply,
>
> The problem is that users are running on the submission node e.g.
>
> module load tensorflow
> srun myprogram
>
> So they get the tens
> Also FWIW, in setting-up the 17.11 on CentOS 7, I encountered these minor
> gotchas:
>
> - Your head/login node's sshd MUST be configured with "X11UseLocalhost no" so
> the X11 TCP port isn't bound to the loopback interface alone
Anyone who read this, please ignore. Red herring thanks to slu
FWIW, though the "--x11" flag is available to srun in 17.11.0, neither the man
page nor the built-in --help mention its presence or how to use it.
Also FWIW, in setting-up the 17.11 on CentOS 7, I encountered these minor
gotchas:
- Your head/login node's sshd MUST be configured with "X11UseLoc
• GraceTime: Specifies a time period for a job to execute after it is selected
to be preempted. This option can be specified by partition or QOS using the
slurm.conf file or database respectively. This option is only honored if
PreemptMode=CANCEL. The GraceTime is specified in seconds and the de
38 matches
Mail list logo