On 06/13/2018 01:59 PM, Prentice Bisbal wrote:
In my environment, we have several partitions that are 'general
access', with each partition providing different hardware resources
(IB, large mem, etc). Then there are other partitions that are for
specific departments/projects. Mo
he problem was elsewhere. When I confirmed it was reservation (with the
help of this list/you), I wanted to break something.
Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov
On 06/15/2018 01:26 PM, Ryan Novosielski wrote:
That’s great news — this is is
Slurm-users,
I'm still learning Slurm, so I have what I think is a basic question.
Can you restart slurmd on nodes where jobs are running, or will that
kill the jobs? I ran into the same problem as described here:
https://bugs.schedmd.com/show_bug.cgi?id=3535
I believe the best way to fix th
51 PM, Chris Harwell wrote:
Ot is possible, but double check your config for timeouts first.
On Fri, Jul 27, 2018, 15:31 Prentice Bisbal <mailto:pbis...@pppl.gov>> wrote:
Slurm-users,
I'm still learning Slurm, so I have what I think is a basic
question.
Can you
Yesterday I upgraded from 18.08.3 to 18.08.4. After the upgrade, I found
that batch scripts named "batch" are being rejected. Simply changing the
script name fixes the problem. For example:
$ sbatch batch
sbatch: error: ERROR: A time limit must be specified
sbatch: error: Batch job submission f
error
$ mv batch nobatchy
$ sbatch nobatchy
Submitted batch job 172174
I hope this helps.
Ahmet M.
19.12.2018 21:54 tarihinde Prentice Bisbal yazdı:
Once I saw that, I understood what the problem was,
Yesterday I upgraded from 18.08.3 to 18.08.4. After the upgrade, I
found that batch scripts na
raded to
18.08.
--
Prentice
On 6/18/18 7:28 AM, Bjørn-Helge Mevik wrote:
Prentice Bisbal writes:
if job_desc.pn_min_mem > 65536 then
slurm.user_msg("NOTICE: Partition switched to mque due to memory
requirements.")
job_desc.partition = 'mque'
job_desc.qos
It appears that 'slurmd -C is not returning the correct information for
some of the systems in my very heterogeneous cluster.
For example, take the node dawson081:
[root@dawson081 ~]# slurmd -C
NodeName=dawson081 slurmd: Considering each NUMA node as a socket
CPUs=32 Boards=1 SocketsPerBoard=4
uration (2 sockets), so the
'correct' physical configuration had been causing those errors.
Prentice
On 1/17/19 3:09 PM, Prentice Bisbal wrote:
It appears that 'slurmd -C is not returning the correct information
for some of the systems in my very heterogeneous cluster.
For example
From https://slurm.schedmd.com/topology.html:
Note that compute nodes on switches that lack a common parent switch
can be used, but no job will span leaf switches without a common
parent (unless the TopologyParam=TopoOptional option is used). For
example, it is legal to remove the line "Switch
And a follow-up question: Does topology.conf need to be on all the
nodes, or just the slurm controller? It's not clear from that web page.
I would assume only the controller needs it.
Prentice
On 1/17/19 4:49 PM, Prentice Bisbal wrote:
From https://slurm.schedmd.com/topology.html:
Note
very
heterogeneous cluster.
I may have to provide a larger description of my hardware/situation to
the list and ask for suggestions on how to best handle the problem.
Prentice
On Jan 17, 2019, at 4:52 PM, Prentice Bisbal wrote:
And a follow-up question: Does topology.conf need to be on all the
s promised a warning about that in the
future in a conversation with SchedMD.
> On Jan 17, 2019, at 4:52 PM, Prentice Bisbal mailto:pbis...@pppl.gov>> wrote:
>
> And a follow-up question: Does topology.conf need to be on all
the nodes, or just the slurm controlle
Ryan,
Thanks for looking into this. I hadn't had a chance to revisit the
documentation since posing my question. Thanks for doing that for me.
Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov
On 1/18/19 2:58 PM, Ryan Novosielski wrote:
ing to address,
and different possible approaches, and then get this list's feedback.
Prentice
On 1/18/19 11:53 AM, Kilian Cavalotti wrote:
On Fri, Jan 18, 2019 at 6:31 AM Prentice Bisbal wrote:
Note that if you care about node weights (eg. NodeName=whatever001 Weight=2,
etc. in slurm
Slurm Users,
I would like your input on the best way to configure Slurm for a
heterogeneous cluster I am responsible for. This e-mail will probably be
a bit long to include all the necessary details of my environment so
thanks in advance to those of you who read all of it!
The cluster I supp
ed to use the resources.
Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov
On 1/22/19 3:38 PM, Prentice Bisbal wrote:
Slurm Users,
I would like your input on the best way to configure Slurm for a
heterogeneous cluster I am responsible for. This e-mail
d the logic of the feature/constraint system to be
quite elegant for meeting complex needs of heterogeneous systems.
Best,
Cyrus
On 1/22/19 2:49 PM, Prentice Bisbal wrote:
I left out a a *very* critical detail: One of the reasons I'm looking
at revamping my Slurm configuration is that my us
. Any job that needed
that license would stay queued while other jobs that used a different
file system would keep humming along.
Anyway, feel free to ping off-list too if there are other ideas that
you'd like to spitball about.
Best,
Cyrus
On 1/23/19 9:00 AM, Prentice Bisbal wrote:
Cyrus
How does one assign a QOS to a partition? This is mentioned several
different places in the Slurm documentation, but nowhere does it explain
exactly how to do this.
You can assign a QOS to a partition in slurm.conf like this:
PartitionName=mypartition Nodes=node[001-100] QOS=myqos
But that do
= PARTITION_DEBUG) then
slurm.log_info("::slurm_job_submit partition DEBUG. Original QOS:
%s, new QOS: %s”, job_desc.qos, QOS_DEBUG)
job_desc.qos=QOS_DEBUG
slurm.log_user(“Setting QoS=%s for this job.”,QOS_DEBUG)
end
[...]
Hope this helps.
Miguel
On 29 Jan 2019, at 16:27, Prentice Bisbal <mailto:p
Can anyone see an error in this conditional in my job_submit.lua?
if ( job_desc.user_id == 28922 or job_desc.user_id == 41266 ) and (
job_desc.partition == 'general' or job_desc.partition == 'interruptible'
) then
job_desc.qos = job_desc.partition
return slurm.SUCCESS
e
all
work fine. I really hope someone on this list has more sensitive
eyeballs than I do.
Prentice
On 2/5/19 8:12 AM, Marcus Wagner wrote:
Hmm...,
no, I was wrong. IT IS 'user_id'.
Now I'm a bit dazzled
Marcus
On 2/4/19 11:27 PM, Prentice Bisbal wrote:
Can anyone see an
it the output to a specific user as below:
if job_desc.user_name == "mercan" then
slurm.log_user("job_desc.user_id=")
slurm.log_user(job_desc.user_id)
slurm.log_user("job_desc.partition=")
slurm.log_user(job_desc.partition)
end
Ahmet M.
On 5.02.20
f for username the dub entry is
set to 1. It then prints the debug message.
So you can use something like
debug("This is a test message")
and only the users, whose debug flag is set, see this message.
As long as you use "debug" for debugging messages, the "normal" u
Also, make sure no 3rd party packages installed software that installs
files in the systemd directories. The legacy spec file still checks for
systemd files to be present:
if [ -d /usr/lib/systemd/system ]; then
install -D -m644 etc/slurmctld.service
$RPM_BUILD_ROOT/usr/lib/systemd/system/s
user_id, I knew my problem had to be elsewhere.
Prentice
On 2/4/19 5:27 PM, Prentice Bisbal wrote:
Can anyone see an error in this conditional in my job_submit.lua?
if ( job_desc.user_id == 28922 or job_desc.user_id == 41266 ) and
( job_desc.partition == 'general' o
--ntasks-per-node is meant to be used in conjunction with --nodes
option. From https://slurm.schedmd.com/sbatch.html:
*--ntasks-per-node*=
Request that /ntasks/ be invoked on each node. If used with the
*--ntasks* option, the *--ntasks* option will take precedence and
the *--ntasks-
I just set this up a couple of weeks ago myself. Creating two partitions
is definitely the way to go. I created one partition, "general" for
normal, general-access jobs, and another, "interruptible" for
general-access jobs that can be interrupted, and then set PriorityTier
accordingly in my slu
On 2/20/19 12:08 AM, Marcus Wagner wrote:
Hi Prentice,
On 2/19/19 2:58 PM, Prentice Bisbal wrote:
--ntasks-per-node is meant to be used in conjunction with --nodes
option. From https://slurm.schedmd.com/sbatch.html:
*--ntasks-per-node*=
Request that /ntasks/ be invoked on each node
On 2/22/19 9:53 AM, Patrice Peterson wrote:
Hello,
it's a little inconvenient that the title tag of all SLURM doc pages only says
"Slurm Workload Manager". I usually have tabs to many SLURM doc pages
open and it's difficult to differentiate between them all.
Would it be possible to change the
On 2/22/19 12:54 AM, Chris Samuel wrote:
On Thursday, 21 February 2019 8:20:36 AM PST נדב טולדו wrote:
Yeah I have, before i installed pbis and introduce lsass.so the slurm module
worked well Is there anyway to debug?
I am seeing in syslog that the slurm module is adopting into the job contex
On 3/7/19 4:39 PM, Tim Wickberg wrote:
-- docs - change HTML title to include the page title or man page name.
Thanks for this change!
--
Prentice
Slurm is trying to kill the job that is exceeding it's time limit, but
the job doesn't die, so Slurm marks the node down because it sees this
as a problem with the node. Increasing the value for GraceTime or
KillWait might help:
*GraceTime*
Specifies, in units of seconds, the preemption
On 3/20/19 1:58 PM, Christopher Samuel wrote:
On 3/20/19 4:20 AM, Frava wrote:
Hi Chris, thank you for the reply.
The team that manages that cluster is not very fond of upgrading
SLURM, which I understand.
As a system admin who manages clusters myself, I don't understand this.
Our job is
On 3/21/19 11:49 AM, Ryan Novosielski wrote:
On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote:
On 3/20/19 1:58 PM, Christopher Samuel wrote:
On 3/20/19 4:20 AM, Frava wrote:
Hi Chris, thank you for the reply.
The team that manages that cluster is not very fond of upgrading SLURM, which I
Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov
On 3/21/19 12:21 PM, Loris Bennett wrote:
Hi Ryan,
Ryan Novosielski writes:
On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote:
On 3/20/19 1:58 PM, Christopher Samuel wrote:
On 3/20/19 4:20
On 3/21/19 1:56 PM, Ryan Novosielski wrote:
On Mar 21, 2019, at 12:21 PM, Loris Bennett wrote:
Our last cluster only hit around 2.5 million jobs after
around 6 years, so database conversion was never an issue. For sites
with a higher-throughput things may be different, but I would hope tha
On 3/21/19 4:40 PM, Reuti wrote:
Am 21.03.2019 um 16:26 schrieb Prentice Bisbal :
On 3/20/19 1:58 PM, Christopher Samuel wrote:
On 3/20/19 4:20 AM, Frava wrote:
Hi Chris, thank you for the reply.
The team that manages that cluster is not very fond of upgrading SLURM, which I
understand
Slurm-users,
My users here have developed a GUI application which serves as a GUI
interface to various physics codes they use. From this GUI, they can
submit jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to
18.08.6-2,and a user has reported a problem when submitting Slurm jobs
t
On 3/21/19 6:56 PM, Reuti wrote:
Am 21.03.2019 um 23:43 schrieb Prentice Bisbal:
Slurm-users,
My users here have developed a GUI application which serves as a GUI interface
to various physics codes they use. From this GUI, they can submit jobs to
Slurm. On Tuesday, we upgraded Slurm from
On 3/22/19 12:40 PM, Reuti wrote:
Am 22.03.2019 um 16:20 schrieb Prentice Bisbal :
On 3/21/19 6:56 PM, Reuti wrote:
Am 21.03.2019 um 23:43 schrieb Prentice Bisbal:
Slurm-users,
My users here have developed a GUI application which serves as a GUI interface
to various physics codes they use
ugh I confess I am unable to
think of any reasonable environmental setting that might cause the
observed symptoms.
On Fri, Mar 22, 2019 at 11:23 AM Prentice Bisbal <mailto:pbis...@pppl.gov>> wrote:
On 3/21/19 6:56 PM, Reuti wrote:
> Am 21.03.2019 um 23:43 schrieb
Chris,
I use that -x switch all the time in other situations. Don't know why I
didn't think of using it in this one. Thanks for reminding me of that.
Prentice
On 3/22/19 1:18 PM, Christopher Samuel wrote:
On 3/21/19 3:43 PM, Prentice Bisbal wrote:
#!/bin/tcsh
Old school script
prior. They told me that they actually plan to
update SLURM but not until late 2019 because they have other things to
do before that. Also, I'm the only one asking for heterogeneous jobs...
Rafael.
Le jeu. 21 mars 2019 à 22:19, Prentice Bisbal <mailto:pbis...@pppl.gov>> a écrit
This is the first place I've had regularly scheduled maintenance, too,
and boy does it make life easier. In most of my previous jobs, it was a
small enough environment that it wasn't necessary.
On 3/22/19 1:57 PM, Christopher Samuel wrote:
On 3/22/19 10:31 AM, Prentice Bisbal wro
On 3/23/19 2:16 PM, Sharma, M D wrote:
Hi folks,
By default slurm allocates the whole node for a job (even if it
specifically requested a single core). This is usually taken care of
by adding SelectType=select/cons_res along with an appropriate
parameter such as SelectTypeParameters=CR_Core_
On 3/25/19 8:09 AM, Mahmood Naderan wrote:
Hi
Is it possible to submit a multinode mpi job with the following config:
Node1: 16 cpu, 90GB
Node2: 8 cpu, 20GB
?
Regards,
Mahmood
Yes:
sbatch -n 24 -w Node1,Node2
That will allocate 24 cores (tasks, technically) to your job, and only
use Node
On 3/27/19 11:25 AM, Christopher Samuel wrote:
On 3/27/19 8:07 AM, Prentice Bisbal wrote:
sbatch -n 24 -w Node1,Node2
That will allocate 24 cores (tasks, technically) to your job, and
only use Node1 and Node2. You did not mention any memory requirements
of your job, so I assumed memory is
On 3/28/19 1:25 PM, Reuti wrote:
Hi,
Am 22.03.2019 um 16:20 schrieb Prentice Bisbal :
On 3/21/19 6:56 PM, Reuti wrote:
Am 21.03.2019 um 23:43 schrieb Prentice Bisbal:
Slurm-users,
My users here have developed a GUI application which serves as a GUI interface
to various physics codes
the dev stated that they’d rather keep that warning than fixing the issue, so
I’m not sure if that’ll be enough to convince them.
Anyone else as disappointed by this as I am? I get that it's too late to
add something like this to 17.11 or 18.08, but it seems like SchedMD
isn't even interested i
15:33 schrieb Prentice Bisbal :
the dev stated that they’d rather keep that warning than fixing the issue, so
I’m not sure if that’ll be enough to convince them.
Anyone else as disappointed by this as I am? I get that it's too late to add
something like this to 17.11 or 18.08, but it seems li
Sergi,
I'm working with Bill on this project. Is all the hardware
identification/mapping and task affinity working as expected/desired
with the Power9? I assume your answer implies "yes", but I just want to
make sure.
Prentice
On 4/16/19 10:37 AM, Sergi More wrote:
Hi,
We have a Power9 cl
.
Getting POWERAI running correctly (bugs since fixed in newer release)
and apps properly built and linked to ESSL was the long march.
regards,
s
On Tue, Apr 16, 2019 at 12:59 PM Prentice Bisbal <mailto:pbis...@pppl.gov>> wrote:
Sergi,
I'm working with Bill on this proje
Mahmood,
What do you see as the problem here? To me, there is no problem and the
scheduler is working exactly has it should. The reason "Resources" means
that there are not enough computing resources available for your job to
run right now, so the job is setting in the queue in the pending sta
n specified.
I thought that slurm will dynamically handles that in order to put
more jobs in running state.
Regards,
Mahmood
On Wed, Apr 17, 2019 at 7:54 PM Prentice Bisbal <mailto:pbis...@pppl.gov>> wrote:
Mahmood,
What do you see as the problem here? To me, there i
Slurm-users,
Is there away to increase a jobs priority based on the resources or
constraints it has requested?
For example, we have a very heterogeneous cluster here: Some nodes only
have 1 Gb Ethernet, some have 10 Gb Ethernet, and others have DDR IB. In
addition, we have some large memory
Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
On Apr 18, 2019, at 5:20 PM, Prentice Bisbal wrote:
Slurm-users,
Is there away to increase a jobs prio
ble nodes they fit in, and the larger or more feature-rich nodes
have a kind of soft reservation either for large jobs or for busy times.
Cheers,
Chris
- Original Message -
From: "Prentice Bisbal"
To: slurm-users@lists.schedmd.com
Sent: Friday, April 19, 2019 11:27:08 AM
Subjec
smaller memory jobs to go only to low memory nodes and keep
large memory nodes free from trash jobs.
The disadvantage is that large mem nodes would wait idle if only low
mem jobs are in the queue.
cheers,
P
- Original Message -----
From: "Prentice Bisbal"
To: slurm-users@li
On 4/23/19 2:47 AM, Mahmood Naderan wrote:
Hi,
How can I change the job distribution policy? Since some nodes are
running non-slurm jobs, it seems that the dispatcher isn't aware of
system load. Therefore, it assumes that the node is free.
I want to change the policy based on the system load
Here's how we handle this here:
Create a separate partition named debug that also contains that node.
Give the debug partition a very short timelimit, say 30 - 60 minutes.
Long enough for debugging, but too short to do any real work. Make the
priority of the debug partition much higher than t
We're running nscd on all nodes, with an extremely stable list of
users/accounts, so I think we should be good here.
Don't bet on it. I've had issues in the past with nscd in similar
situations to this. There's a reason that daemon has a "paranoid" option.
Hostname should be completely local
I see two separate, unrelated problems here:
Problem 1:
Warning: untrusted X11 forwarding setup failed: xauth key data not
generated
What have you done to investigate this xauth problem further?
I know there have been discussions about this problem in the past on
this mailing list. Did you
In this case, I would run LINPACK on each generation of node (either the
full node or just one core), and then somehow normalize performance. I
would recommend using the performance of a single core of the slowest
node as your basis for normalization so it has a multiplier of 1, and
then the n
I have a strange issue:
sreport is showing 100% utilization for our cluster every day since June
18. What is interesting about this is June 18th was our last maintenance
outage, when all the nodes were rebooted, including our slurm server
which runs both slurmdbd and slurmctld. Has anyone else
Slurm users,
I have created a partition named general should allow the QOSes
'general' and 'debug':
PartitionName=general Default=YES AllowQOS=general,debug Nodes=.
However, when I try to request that QOS, I get an error:
$ salloc -p general -q debug -t 00:30:00
salloc: error: Job submi
I should add that I still get this error even when I remove the
"AllowQOS" attribute from the partition definition altogether:
$ salloc -p general -q debug -t 00:30:00
salloc: error: Job submit/allocate failed: Invalid qos specification
Prentice
On 7/15/19 2:22 PM, Prentice Bi
wrote:
On 7/15/19 11:22 AM, Prentice Bisbal wrote:
$ salloc -p general -q debug -t 00:30:00
salloc: error: Job submit/allocate failed: Invalid qos specification
what does:
scontrol show part general
say?
Also, does the user you're testing as have access to that QOS?
All the best,
Chris
sacctmgr. Right now I'd say if you did
sacctmgr show user withassoc that the QoS you're attempting to
use is NOT listed as part of the association.
On Mon, Jul 15, 2019 at 2:53 PM Prentice Bisbal <mailto:pbis...@pppl.gov>> wrote:
Slurm users,
I have created a part
Account=unix QOS=debug
I will go stand in the corner now...
Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov
On 7/15/19 3:18 PM, Prentice Bisbal wrote:
That explanation makes perfect sense, but after adding debug to my
list of QOSes in my associations
Remove this line:
#SBATCH --nodes=1
Slurm assumes you're requesting the whole node. --ntasks=1 should be
adequate.
On 11/7/19 4:19 PM, Mike Mosley wrote:
Greetings all:
I'm attempting to configure the scheduler to schedule our GPU boxes
but have run into a bit of a snag.
I have a box wi
Is there any way to see how much a job used the GPU(s) on a cluster
using sacct or any other slurm command?
--
Prentice
/19 1:48 PM, Ryan Novosielski wrote:
Do you mean akin to what some would consider "CPU efficiency" on a CPU job? "How
much... used" is a little vague.
From: slurm-users on behalf of Prentice
Bisbal
Sent: Thursday, November 14,
Thanks!
Prentice
On 11/15/19 6:58 AM, Janne Blomqvist wrote:
On 14/11/2019 20.41, Prentice Bisbal wrote:
Is there any way to see how much a job used the GPU(s) on a cluster
using sacct or any other slurm command?
We have created
https://github.com/AaltoScienceIT/ansible-role-sacct_gpu/ as a
Does Slurm provide an option to allow developers submit jobs right
from their own PCs?
Yes. They just need to have the relevant Slurm packages installed, and
the necessary configuration file(s).
Prentice
On 12/11/19 11:39 PM, Victor (Weikai) Xie wrote:
Hi,
We are trying to setup a tiny
On 12/19/19 10:44 AM, Ransom, Geoffrey M. wrote:
The simplest is probably to just have a separate partition that will
only allow job times of 1 hour or less.
This is how our Univa queues used to work, by overlapping the same
hardware. Univa shows available “slots” to the users and we had a
The binaries are probably compiled with the RPATH switch enabled.
https://en.wikipedia.org/wiki/Rpath
Prentice
On 2/18/20 6:18 PM, Dean Schulze wrote:
The ./configure --prefix=... value gets into the Makefiles, which is
no surprise. But it is also getting into the slurmd binary and .o
files.
Last week I upgraded from Slurm 18.08 to Slurm 19.05. Since that time,
several users have reported to me that they can't submit jobs without
specifying a memory requirement. In a way, this is intended - my
job_submit.lua script checks to make sure that --mem or --mem-per-node
is specified, and
I believe you can do this with QOS, by assigning group limits to the QOS.
Prentice
On 4/8/20 1:38 PM, Renata Maria Dart wrote:
Hi, is there a way to specify a certain number of nodes for a
partition without specifying exactly which nodes to use? For
instance, if I have 100 hosts and would lik
Slurm-users,
Is there anything special needed to build Slurm 19.05.5 on CentOS 7 with
PMIx support? We have the pmix and pmix-devel packages that come with
CentOS 7 installed, and built the RPMs with the following rpmbuild command:
rpmbuild -ta --with munge --with pam --with lua --with pmix
We are in the process of upgrading to CentOS 7, and have built Slurm
19.05.5 and OpenMPI 4.0.3 for CentOS 7. When I submit that launches
using srun, the job appears to be running according to squeue, (state =
R), but the program doesn't do anything. I'm testing with a simple
Hello, World progra
out needing hackish
extra utilities to create and manage cluster-specific passphraseless
key pairs for every single user! :-)
There's a great cookbook online that tells you step-by-step how to set
it up: https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Host-based_Authentication
HTH!
Michael
-
le Holm Nielsen wrote:
Hi Prentice,
Could you kindly elaborate on this statement? Is host-based security
safe inside a compute cluster compared to user-based SSH keys?
Thanks,
Ole
On 09-06-2020 21:26, Prentice Bisbal wrote:
Host-based security is not considered as safe as user-based security,
so
I maintain a very heterogeneous cluster (different processors, different
amounts of RAM, etc.) I have a user reporting the following problem.
He's running the same job multiple times with different input
parameters. The jobs run fine unless they land on specific nodes. He's
specifying --mem=2G
of NJ | Office of Advanced Research Computing - MSB
C630, Newark
`'
On Jul 2, 2020, at 09:53, Prentice Bisbal wrote:
I maintain a very heterogeneous cluster (different processors,
different amounts of RAM, etc.) I have a user reporting the following
problem.
He's running t
^C
--- google.com <http://google.com> ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2026ms
I guess that is related to slurm and srun.
Any idea for that?
Regards,
Mahmood
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov
=0 ExtSensorsTemp=n/s
Reason=Reboot ASAP [root@2020-08-06T10:29:22]
Any thoughts as to how to cancel the reboot?
Mike Hanby
mhanby @ uab.edu
Systems Analyst III - Enterprise
IT Research Computing Services
The University of Alabama at Birmingham
--
Prentice Bisbal
Lead Softwa
Max,
You didn't quote the original e-mail so I'm not sure what the original
problem was, or who "you" is.
--
Prentice
On 8/12/20 6:55 AM, Max Quast wrote:
I am also trying to use ucx with slurm/PMIx and get the same error.
Also mpirun with "--mca pml ucx" works fine.
Used versions:
Ubu
Yes, you can do this using Slurm's QOS facility to limit the number of
nodes used simultaneously, for the high-priority partition you can use
the GrpTRES setting to limit how many nodes or CPUs a QOS can use.
--
Prentice
On 8/17/20 1:13 PM, Gerhard Strangar wrote:
Hello,
I'm wondering if it'
y to try and find
that balance between good utilisation, effective use of the system and reaching
the desired science/research/development outcomes.
Best of luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov
On 10/20/20 3:01 AM, SJTU wrote:
Hi,
We reserved compute node resource on SLURM for specific users and hope they
will make good use of it. But in some cases users forgot the '--reservation'
parameter in job scripts, competing with other users outside the reserved nodes.
Is there a recommended
w. If you are not the intended recipient, any
disclosure, distribution or other use of this e-mail message
or attachments is prohibited. If you have received this
e-mail message in error, please delete and notify the sender
immediately. Thank you.
--
Prentic
node, but I didn’t anticipate ntp and munge issues.
Thanks,
Gard
*From: *slurm-users on behalf
of Prentice Bisbal
*Reply-To: *Slurm User Community List
*Date: *Tuesday, October 27, 2020 at 12:22 PM
*To: *"slurm-users@lists.schedmd.com"
*Subject: *Re: [slurm-users] [External] Mun
up a node on a different subnet. I figured it be simple to point
slurm to the new node, but I didn’t anticipate ntp and munge issues.
Thanks,
Gard
*From: *slurm-users
<mailto:slurm-users-boun...@lists.schedmd.com> on behalf of
Prentice Bisbal <mailto:pbis...@ppp
I know I'm very late to this thread, but were/are you using the
--allusers flag to sacct? If not, sacct only returns results for the
user running the command (not sure if this is the case for root - I
never need to run sacct as root). This minor detail tripped me up a few
days ago when I was ex
We generally upgrade within 1-2 maintenance windows of a new release
coming out, so within a couple of months of the release being available.
For minor updates, we update at the next maintenance window. At one
point, we were stuck several release behind. Getting all caught up
wasn't that bad. I
Slurm users,
I'm planning on moving slurmctld and slurmdbd to a new host. I know how
to dump the MySQL DB from the old server and import it to the new
slurmdbd host, and I know how to copy the job state directories to the
new host. I plan on doing this during our next maintenance window when
`'
On Jan 15, 2021, at 13:44, Prentice Bisbal mailto:pbis...@pppl.gov>> wrote:
Slurm users,
I'm planning on moving slurmctld and slurmdbd to a new host. I
know how to dump the MySQL DB from the old server and import it
to the new slurmdbd host, and I know ho
ronment variables into the script or is it a problem due to my
distribution (CentOS-7)???
Thanks.
--
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroads paye...@umd.edu <mailto:paye...@umd.edu>
5825 University Research Park (301) 405-6135
University of Maryland
College Park, MD 20740
1 - 100 of 168 matches
Mail list logo