The line that you list from your slurm.conf shows the "course" partition
being set as the default partition, but on our system the sinfo command
shows our default partition with a * at the end and your output doesn't
show that so I'm wondering if you've got another partition that is
getting def
I believe that fragmentation only happens on routers when passing traffic from
one subnet to another. Since this traffic was all on a single subnet there was
no router involved to fragment the packets.
Mike
On 11/26/18 1:49 PM, Kenneth Roberts wrote:
D’oh!
The compute nodes had different MTU o
Wes,
You didn't list the Slurm command that you used to get your interactive
session. In particular did you ask Slurm for access to all 14 cores?
Also note that since Matlab is using threads to distribute work among cores you
don't want to ask for multiple tasks (-n or --ntasks) as that will gi
If you want to detect lost DIMMs or anything like that use a Node Health
Check script. I recommend and use this one: https://github.com/mej/nhc
It has an option to generate a configuration file that will watch way
more than you probably need, but if you want to know if something on
your nodes h
Andreas,
Look again. I just looked and a commit to the source code was posted to
the bug yesterday afternoon. It looks like that patch applies to the
cgroup plugin. It won't show up until the next release, but at least
there is a fix available.
Mike Robbert
On 1/15/19 11:43 PM, Henkel, Andrea
Cola,
You need to use the legacy spec file from the contribs directory:
ls -l slurm-18.08.5/contribs/slurm.spec-legacy
-rw-r--r-- 1 mrobbert mrobbert 38574 Jan 30 11:59
slurm-18.08.5/contribs/slurm.spec-legacy
Mike
On 2/11/19 9:26 AM, Colas Rivière wrote:
> Hello,
>
> I'm trying to update slur
I was curious what startup method other sites are using with Intel MPI?
According to the documentation srun with Slurm's PMI is the recommended
way ( https://slurm.schedmd.com/mpi_guide.html#intel_srun ).
Intel has supposedly supported PMI-2 since their 2017 release and that
is what SchedMD sug
Samuel wrote:
On Monday, 29 April 2019 8:47:49 AM PDT Michael Robbert wrote:
Intel has supposedly supported PMI-2 since their 2017 release and that
is what SchedMD suggested we use in a recent bug report to them, but I
found that it no longer works in Intel MPI 2019. I opened a bug report
with
The more flexible way to do this is with QoS. (PreemptType=preempt/qos) You'll
need to have Accounting enabled and you'll probably want qos listed in
AccountingStorageEnforce. Once you do that you create a "shared" for the
scavenger jobs, a QoS for each group that buys into resources. Assign the
It looks like you have hyper-threading turned on, but haven’t defined the
ThreadsPerCore=2. You either need to turn off Hyper-threading in the BIOS or
changed the definition of ThreadsPerCore in slurm.conf.
Mike
From: slurm-users on behalf of Robert
Kudyba
Reply-To: Slurm User Communit
at 1:43 PM Michael Robbert wrote:
It looks like you have hyper-threading turned on, but haven’t defined the
ThreadsPerCore=2. You either need to turn off Hyper-threading in the BIOS or
changed the definition of ThreadsPerCore in slurm.conf.
Nice find. node003 has hyper threading enabled but
Manuel,
You may want to instruct your users to use ‘-c’ or ‘—cpus-per-task’ to define
the number of cpus that they need. Please correct me if I’m wrong, but I
believe that will restrict the jobs to a singe node whereas ‘-n’ or ‘—ntasks’
is really for multi process jobs which can be spread among
You’re on the right track with the DRAIN state. The more specific answer is in
the “Reason=” description on the last line.
It looks like your node has less memory than what you’ve defined for the node
in slurm.conf
Mike
From: slurm-users on behalf of Joakim
Hove
Reply-To: Slurm User
You have defined both of your partitions with “Default=YES”, but Slurm can have
only one default partition. You can see from * on the compute partition in your
sinfo output that Slurm selected that one as the default. When you use srun or
sbatch it will only look at the default partition unless
Those files in /run/system/generator.late/ look like they came from older
SystemV init scripts. Can you check to make sure you don't have a slurm service
script in /etc/init.d/?
Also, note that there is a difference between the "slurm" service and the
"slurmd" service. The former was the older n
Peter,
I believe that the answer to your database question is that you don't have two
MySQL/MariaDB servers running at the same time. The only way that I know of to
run MySQL/MariaDB in an active-active setup, which is what you appear to be
describing, is with replication. The other setup is to
We saw something that sounds similar to this. See this bug report:
https://bugs.schedmd.com/show_bug.cgi?id=10196
SchedMD never found the root cause. They thought it might have something to do
with a timing problem on Prolog scripts, but the thing that fixed it for us was
to set GraceTime=0 on
I haven't tried configless setup yet, but the problem you're hitting looks like
it could be a DNS issue. Can you do a dns lookup of n26 from the login node?
The way that non-interactive batch jobs are started may not require that, but I
believe that it is required for interactive jobs.
Mike Ro
I think that you want to use the output of slurmd -C, but if that isn’t telling
you the truth then you may not have built slurm with the correct libraries. I
believe that you need to build with hwloc in order to get the most accurate
details of the CPU topology. Make sure you have hwloc-devel in
I’m wondering if others in the Slurm community have any tips or best practices
for the development and testing of Lua job submit plugins. Is there anything
that can be done prior to deployment on a production cluster that will help to
ensure the code is going to do what you think it does or at t
I can confirm that we do preemption based on partition for one of our clusters.
I will say that we are not using time-based partitions, ours are always up and
they are based on group node ownership. I wonder if Slurm is refusing to
preempt a job in a DOWN partition. Maybe try leaving the partiti
I doubt that it is a problem with your script and suspect that there is some
weird interaction with scancel on interactive jobs. If you wanted to get to the
bottom of that I’d suggest disabling the prolog and test by manually cancelling
some interactive jobs.
Another suggestion is to try a compl
e could look at the number of gpu
passed?), but where do i set up that function and where do i call it?
Thanks,
Fritz Ratnasamy
Data Scientist
Information Technology
The University of Chicago
Booth School of Business
5807 S. Woodlawn
Chicago, Illinois 60637
Phone: +(1) 773-834-4556
On Wed, Aug 25
It looks like it could be some kind of network problem but could be DNS. Can
you ping and do DNS resolution for the host involved?
What does slurmctld.log say? How about slurmd.log on the node in question?
Mike
From: slurm-users on behalf of Durai
Arasan
Date: Thursday, January 20, 2022 at 08
They moved Arbiter2 to Github. Here is the new official repo:
https://github.com/CHPC-UofU/arbiter2
Mike
On 2/7/22, 06:51, "slurm-users" wrote:
Hi,
I've just noticed that the repository https://gitlab.chpc.utah.edu/arbiter2
seems is down. Does someone know more?
Thank you!
Best,
Stefan
Am D
Jim,I’m glad you got your problem solved. Here is an additional tip that will make it easier to fix in the future. You don’t need to put scrontrol into a loop, the NodeName parameter will take a node range _expression_. So, you can use NodeName=sjc01enadsapp[01-08]. A SysAdmin in training saw me do
Don’t forget about munge. You need to have munged running with the same key as the rest of the cluster in order to authenticate. Mike RobbertCyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research ComputingInformation and Technology Solutions (ITS)303-273-3786 | mrobb...@mines.edu
Have you looked at the High Throughput Computing Administration Guide: https://slurm.schedmd.com/high_throughput.htmlIn particular, for this problem may be to look at the SchedulerParameters. I believe that the scheduler defaults to be very conservative and will stop looking for jobs to run pretty
I believe that the error you need to pay attention to for this issue is this
line:
Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out of
sync clocks
It looks like your compute nodes clock is a full day ahead of your controller
node. Dec. 2 instead of Dec. 1. The clo
ensorsTemp=n/s
>
> Where as this command shows only one node on which job is running:
>
> *(base) [nousheen@nousheen slurm]$ squeue -j*
> JOBID PARTITION NAME USER ST TIME NODES
> NODELIST(REASON)
> 109 debug SRBD-4 nousheen
George,
I haven't tested or used this, but why won't afterany do what you want?
afterany:job_id[:jobid...]
This job can begin execution after the specified
jobs have terminated.
Mike
On 1/19/18 11:09 AM, Hwa, George wrote:
I have a “reaper” job that harve
I think that the piece you may be missing is --pty, but I also don't
think that salloc is necessary.
The most simple command that I typically use is:
srun -N1 -n1 --pty bash -i
Mike
On 3/9/18 10:20 AM, Andy Georges wrote:
Hi,
I am trying to get interactive jobs to work from the machine we
Mahmood,
You need to put all the options to srun before the executable that you
want to run, which in this case is /bin/bash. So, it should look more like:
srun -l -a em1 -p IACTIVE --mem=4GB --pty -u /bin/bash
The way you have it most of your srun options are being interpreted as
bash argum
33 matches
Mail list logo