Hello list,
My cluster usually has a pretty heterogenous job load and spends a lot of the
time memory bound. Ocassionally I have users that submit 100k+ short, low
resource jobs. Despite having several thousand free cores and enough RAM to
run the jobs, the backfill scheduler would never back
Hi,
Hopefully this isn't an obvious fix I'm missing. We have a large number of KNL
nodes that can get rebooted when their memory or cluster modes are changed by
users. I never heard any complaints when running Slurm v16.05.10, but I've
seen a number of issues since our upgrade a couple months
> On Oct 10, 2018, at 12:07 PM, Noam Bernstein
> wrote:
>
>
> slurmd -C confirms that indeed slurm understands the architecture, so that’s
> good. However, removing the CPUs entry from the node list doesn’t change
> anything. It still drains the node. If I just remove _everything_ having t
On 11/10/18 01:27, Christopher Benjamin Coffey wrote:
That is interesting. It is disabled in 17.11.10:
Yeah, I seem to remember seeing a commit that disabled in 17.11.x.
I don't think it's meant to work before 18.08.x (which is what the
website will be talking about).
All the best,
Chris
-
> On Oct 10, 2018, at 11:40 AM, Eli V wrote:
>
> Don't think you need CPUs in slurm.conf for the node def, just
> Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 for example, and the
> slurmctld does the math for # cpus. Also slurmd -C on the nodes is
> very useful to see what's being autodetected.
Don't think you need CPUs in slurm.conf for the node def, just
Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 for example, and the
slurmctld does the math for # cpus. Also slurmd -C on the nodes is
very useful to see what's being autodetected.
On Wed, Oct 10, 2018 at 11:34 AM Noam Bernstein
wrote:
>
Hi all - I’m new to slurm, and in many ways it’s been very nice to work with,
but I’m having an issue trying to properly set up thread/core/socket counts on
nodes. Basically, if I don’t specify anything except CPUs, the node is
available, but doesn’t appear to know about cores and hyperthreadin
That is interesting. It is disabled in 17.11.10:
static bool _enable_pack_steps(void)
{
bool enabled = false;
char *sched_params = slurm_get_sched_params();
if (sched_params && strstr(sched_params, "disable_hetero_steps"))
enabled = false;
else if (
I got this same error when testing on older updates (17.11?). Try the
Slurm-18.08 branch or master. I'm testing 18.08 now and get this:
[slurm@trek6 mpihello]$ srun -phyper -n3 --mpi=pmi2 --pack-group=0-2
./mpihello-ompi2-rhel7 | sort
srun: job 643 queued and waiting for resources
srun: job 64
Hi Christopher,
We hit some problems at LANL trying to use this SLURm feature.
At the time, I think SchedMD said there would need to be fixes
to the SLURM PMI2 library to get this to work.
What version of SLURM are you using?
Howard
--
Howard Pritchard
B Schedule
HPC-ENV
Office 9, 2nd floor
Hello,
Thank you for your useful replies. It certainly not anywhere as difficult as I
initially thought. We should be able to start some tests later this week.
Best regards,
David
From: slurm-users on behalf of Roche
Ewan
Sent: 10 October 2018 08:07
To: S
Hi,
I have setup slurm and enabled x11 forwarding (native).
I connect to a node from a submission node:
srun --ntasks-per-node=1 --mem 100 --x11 --pty bash
I am connected to the node.
in debug logs, I can see that x11 setup is OK:
...
[2018-10-10T09:27:48.142] [131.extern] X11 forwardin
On 10/10/18 05:07, Christopher Benjamin Coffey wrote:
Yet, we get an error: " srun: fatal: Job steps that span multiple
components of a heterogeneous job are not currently supported". But
the docs seem to indicate it should work?
Which version of Slurm are you on? It was disabled by default i
Hello David,
for this use case we have two partitions - serial and parallel (the default).
Our lua looks like:
function slurm_job_submit(job_desc, part_list, submit_uid)
-- As the default partition is set later by SLURM we need to set it here using
the same logic
if job_desc.partitio
14 matches
Mail list logo