Re: [slurm-users] Checking memory requirements in job_submit.lua

2018-06-14 Thread Prentice Bisbal
On 06/13/2018 01:59 PM, Prentice Bisbal wrote: In my environment, we have several partitions that are 'general access', with each partition providing different hardware resources (IB, large mem, etc). Then there are other partitions that are for specific departments/projects. Mo

Re: [slurm-users] How to check if there's a reservation

2018-06-15 Thread Prentice Bisbal
he problem was elsewhere. When I confirmed it was reservation (with the help of this list/you), I wanted to break something. Prentice Bisbal Lead Software Engineer Princeton Plasma Physics Laboratory http://www.pppl.gov On 06/15/2018 01:26 PM, Ryan Novosielski wrote: That’s great news — this is is

[slurm-users] restart slurmd on nodes w/ running jobs?

2018-07-27 Thread Prentice Bisbal
Slurm-users, I'm still learning Slurm, so I have what I think is a basic question. Can you restart slurmd on nodes where jobs are running, or will that kill the jobs? I ran into the same problem as described here: https://bugs.schedmd.com/show_bug.cgi?id=3535 I believe the best way to fix th

Re: [slurm-users] restart slurmd on nodes w/ running jobs?

2018-07-30 Thread Prentice Bisbal
51 PM, Chris Harwell wrote: Ot is possible, but double check your config for timeouts first. On Fri, Jul 27, 2018, 15:31 Prentice Bisbal <mailto:pbis...@pppl.gov>> wrote: Slurm-users, I'm still learning Slurm, so I have what I think is a basic question. Can you

[slurm-users] 18.08.4 - batch scripts named "batch" getting rejected.

2018-12-19 Thread Prentice Bisbal
Yesterday I upgraded from 18.08.3 to 18.08.4. After the upgrade, I found that batch scripts named "batch" are being rejected. Simply changing the script name fixes the problem. For example: $ sbatch batch sbatch: error: ERROR: A time limit must be specified sbatch: error: Batch job submission f

Re: [slurm-users] 18.08.4 - batch scripts named "batch" getting rejected.

2018-12-20 Thread Prentice Bisbal
error $ mv batch nobatchy $ sbatch nobatchy Submitted batch job 172174 I hope this helps. Ahmet M. 19.12.2018 21:54 tarihinde Prentice Bisbal yazdı: Once I saw that, I understood what the problem was, Yesterday I upgraded from 18.08.3 to 18.08.4. After the upgrade, I found that batch scripts na

Re: [slurm-users] Checking memory requirements in job_submit.lua

2019-01-14 Thread Prentice Bisbal
raded to 18.08. -- Prentice On 6/18/18 7:28 AM, Bjørn-Helge Mevik wrote: Prentice Bisbal writes: if job_desc.pn_min_mem > 65536 then     slurm.user_msg("NOTICE: Partition switched to mque due to memory requirements.")     job_desc.partition = 'mque'     job_desc.qos

[slurm-users] 'slurmd -c' not returning correct information

2019-01-17 Thread Prentice Bisbal
It appears that 'slurmd -C is not returning the correct information for some of the systems in my very heterogeneous cluster. For example, take the node dawson081: [root@dawson081 ~]# slurmd -C NodeName=dawson081 slurmd: Considering each NUMA node as a socket CPUs=32 Boards=1 SocketsPerBoard=4

Re: [slurm-users] 'slurmd -c' not returning correct information

2019-01-17 Thread Prentice Bisbal
uration (2 sockets), so the 'correct' physical configuration had been causing those errors. Prentice On 1/17/19 3:09 PM, Prentice Bisbal wrote: It appears that 'slurmd -C is not returning the correct information for some of the systems in my very heterogeneous cluster. For example

[slurm-users] Topology configuration questions:

2019-01-17 Thread Prentice Bisbal
From https://slurm.schedmd.com/topology.html: Note that compute nodes on switches that lack a common parent switch can be used, but no job will span leaf switches without a common parent (unless the TopologyParam=TopoOptional option is used). For example, it is legal to remove the line "Switch

Re: [slurm-users] Topology configuration questions:

2019-01-17 Thread Prentice Bisbal
And a follow-up question: Does topology.conf need to be on all the nodes, or just the slurm controller? It's not clear from that web page. I would assume only the controller needs it. Prentice On 1/17/19 4:49 PM, Prentice Bisbal wrote: From https://slurm.schedmd.com/topology.html: Note

Re: [slurm-users] Topology configuration questions:

2019-01-18 Thread Prentice Bisbal
very heterogeneous cluster. I may have to provide a larger description of my hardware/situation to the list and ask for suggestions on how to best handle the problem. Prentice On Jan 17, 2019, at 4:52 PM, Prentice Bisbal wrote: And a follow-up question: Does topology.conf need to be on all the

Re: [slurm-users] Topology configuration questions:

2019-01-18 Thread Prentice Bisbal
s promised a warning about that in the future in a conversation with SchedMD. > On Jan 17, 2019, at 4:52 PM, Prentice Bisbal mailto:pbis...@pppl.gov>> wrote: > > And a follow-up question: Does topology.conf need to be on all the nodes, or just the slurm controlle

Re: [slurm-users] Topology configuration questions:

2019-01-22 Thread Prentice Bisbal
Ryan, Thanks for looking into this. I hadn't had a chance to revisit the documentation since posing my question. Thanks for doing that for me. Prentice Bisbal Lead Software Engineer Princeton Plasma Physics Laboratory http://www.pppl.gov On 1/18/19 2:58 PM, Ryan Novosielski wrote:

Re: [slurm-users] Topology configuration questions:

2019-01-22 Thread Prentice Bisbal
ing to address, and different possible approaches, and then get this list's feedback. Prentice On 1/18/19 11:53 AM, Kilian Cavalotti wrote: On Fri, Jan 18, 2019 at 6:31 AM Prentice Bisbal wrote: Note that if you care about node weights (eg. NodeName=whatever001 Weight=2, etc. in slurm

[slurm-users] Configuration recommendations for heterogeneous cluster

2019-01-22 Thread Prentice Bisbal
Slurm Users, I would like your input on the best way to configure Slurm for a heterogeneous cluster I am responsible for. This e-mail will probably be a bit long to include all the necessary details of my environment so thanks in advance to those of you who read all of it! The cluster I supp

Re: [slurm-users] Configuration recommendations for heterogeneous cluster

2019-01-22 Thread Prentice Bisbal
ed to use the resources. Prentice Bisbal Lead Software Engineer Princeton Plasma Physics Laboratory http://www.pppl.gov On 1/22/19 3:38 PM, Prentice Bisbal wrote: Slurm Users, I would like your input on the best way to configure Slurm for a heterogeneous cluster I am responsible for. This e-mail

Re: [slurm-users] Configuration recommendations for heterogeneous cluster

2019-01-23 Thread Prentice Bisbal
d the logic of the feature/constraint system to be quite elegant for meeting complex needs of heterogeneous systems. Best, Cyrus On 1/22/19 2:49 PM, Prentice Bisbal wrote: I left out a a *very* critical detail: One of the reasons I'm looking at revamping my Slurm configuration is that my us

Re: [slurm-users] Configuration recommendations for heterogeneous cluster

2019-01-28 Thread Prentice Bisbal
. Any job that needed that license would stay queued while other jobs that used a different file system would keep humming along. Anyway, feel free to ping off-list too if there are other ideas that you'd like to spitball about. Best, Cyrus On 1/23/19 9:00 AM, Prentice Bisbal wrote: Cyrus

[slurm-users] Assigning a QOS to a partition?

2019-01-29 Thread Prentice Bisbal
How does one assign a QOS to a partition? This is mentioned several different places in the Slurm documentation, but nowhere does it explain exactly how to do this. You can assign a QOS to a partition in slurm.conf like this: PartitionName=mypartition Nodes=node[001-100] QOS=myqos But that do

Re: [slurm-users] Assigning a QOS to a partition?

2019-01-30 Thread Prentice Bisbal
= PARTITION_DEBUG) then   slurm.log_info("::slurm_job_submit partition DEBUG. Original QOS: %s, new QOS: %s”, job_desc.qos, QOS_DEBUG)   job_desc.qos=QOS_DEBUG   slurm.log_user(“Setting QoS=%s for this job.”,QOS_DEBUG) end [...] Hope this helps. Miguel On 29 Jan 2019, at 16:27, Prentice Bisbal <mailto:p

[slurm-users] Error in job_submit.lua conditional?

2019-02-04 Thread Prentice Bisbal
Can anyone see an error in this conditional in my job_submit.lua?     if ( job_desc.user_id == 28922 or job_desc.user_id == 41266 ) and ( job_desc.partition == 'general' or job_desc.partition == 'interruptible' ) then     job_desc.qos = job_desc.partition     return slurm.SUCCESS     e

Re: [slurm-users] Error in job_submit.lua conditional?

2019-02-06 Thread Prentice Bisbal
all work fine. I really hope someone on this list has more sensitive eyeballs than I do. Prentice On 2/5/19 8:12 AM, Marcus Wagner wrote: Hmm..., no, I was wrong. IT IS 'user_id'. Now I'm a bit dazzled Marcus On 2/4/19 11:27 PM, Prentice Bisbal wrote: Can anyone see an

Re: [slurm-users] Error in job_submit.lua conditional?

2019-02-06 Thread Prentice Bisbal
it the output to a specific user as below: if job_desc.user_name == "mercan" then     slurm.log_user("job_desc.user_id=")     slurm.log_user(job_desc.user_id)     slurm.log_user("job_desc.partition=")     slurm.log_user(job_desc.partition) end Ahmet M. On 5.02.20

Re: [slurm-users] Error in job_submit.lua conditional?

2019-02-06 Thread Prentice Bisbal
f for username the dub entry is set to 1. It then prints the debug message. So you can use something like debug("This is a test message") and only the users, whose debug flag is set, see this message. As long as you use "debug" for debugging messages, the "normal" u

Re: [slurm-users] Does latest slurm version still work on CentOS 6?

2019-02-11 Thread Prentice Bisbal
Also, make sure no 3rd party packages installed software that installs files in the systemd directories. The legacy spec file still checks for systemd files to be present: if [ -d /usr/lib/systemd/system ]; then    install -D -m644 etc/slurmctld.service $RPM_BUILD_ROOT/usr/lib/systemd/system/s

Re: [slurm-users] Error in job_submit.lua conditional?

2019-02-13 Thread Prentice Bisbal
user_id, I knew my problem had to be elsewhere. Prentice On 2/4/19 5:27 PM, Prentice Bisbal wrote: Can anyone see an error in this conditional in my job_submit.lua?     if ( job_desc.user_id == 28922 or job_desc.user_id == 41266 ) and ( job_desc.partition == 'general' o

Re: [slurm-users] Strange error, submission denied

2019-02-19 Thread Prentice Bisbal
--ntasks-per-node is meant to be used in conjunction with --nodes option. From https://slurm.schedmd.com/sbatch.html: *--ntasks-per-node*= Request that /ntasks/ be invoked on each node. If used with the *--ntasks* option, the *--ntasks* option will take precedence and the *--ntasks-

Re: [slurm-users] Priority access for a group of users

2019-02-19 Thread Prentice Bisbal
I just set this up a couple of weeks ago myself. Creating two partitions is definitely the way to go. I created one partition, "general" for normal, general-access jobs, and another, "interruptible" for general-access jobs that can be interrupted, and then set PriorityTier accordingly in my slu

Re: [slurm-users] Strange error, submission denied

2019-02-20 Thread Prentice Bisbal
On 2/20/19 12:08 AM, Marcus Wagner wrote: Hi Prentice, On 2/19/19 2:58 PM, Prentice Bisbal wrote: --ntasks-per-node is meant to be used in conjunction with --nodes option. From https://slurm.schedmd.com/sbatch.html: *--ntasks-per-node*= Request that /ntasks/ be invoked on each node

Re: [slurm-users] SLURM docs: HTML title should be same as page title

2019-02-22 Thread Prentice Bisbal
On 2/22/19 9:53 AM, Patrice Peterson wrote: Hello, it's a little inconvenient that the title tag of all SLURM doc pages only says "Slurm Workload Manager". I usually have tabs to many SLURM doc pages open and it's difficult to differentiate between them all. Would it be possible to change the

Re: [slurm-users] pam_slurm_adopt with pbis-open pam modules

2019-02-22 Thread Prentice Bisbal
On 2/22/19 12:54 AM, Chris Samuel wrote: On Thursday, 21 February 2019 8:20:36 AM PST נדב טולדו wrote: Yeah I have, before i installed pbis and introduce lsass.so the slurm module worked well Is there anyway to debug? I am seeing in syslog that the slurm module is adopting into the job contex

Re: [slurm-users] [slurm-announce] Slurm versions 18.08.6 is now available, as well as 19.05.0pre2, and Slurm on GCP update

2019-03-07 Thread Prentice Bisbal
On 3/7/19 4:39 PM, Tim Wickberg wrote: -- docs - change HTML title to include the page title or man page name. Thanks for this change! -- Prentice

Re: [slurm-users] Slurm cannot kill a job which time limit exhausted

2019-03-19 Thread Prentice Bisbal
Slurm is trying to kill the job that is exceeding it's time limit, but the job doesn't die, so Slurm marks the node down because it sees this as a problem with the node. Increasing the value for GraceTime or  KillWait might help: *GraceTime* Specifies, in units of seconds, the preemption

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Prentice Bisbal
On 3/20/19 1:58 PM, Christopher Samuel wrote: On 3/20/19 4:20 AM, Frava wrote: Hi Chris, thank you for the reply. The team that manages that cluster is not very fond of upgrading SLURM, which I understand. As a system admin who manages clusters myself, I don't understand this. Our job is

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Prentice Bisbal
On 3/21/19 11:49 AM, Ryan Novosielski wrote: On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote: On 3/20/19 1:58 PM, Christopher Samuel wrote: On 3/20/19 4:20 AM, Frava wrote: Hi Chris, thank you for the reply. The team that manages that cluster is not very fond of upgrading SLURM, which I

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Prentice Bisbal
Prentice Bisbal Lead Software Engineer Princeton Plasma Physics Laboratory http://www.pppl.gov On 3/21/19 12:21 PM, Loris Bennett wrote: Hi Ryan, Ryan Novosielski writes: On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote: On 3/20/19 1:58 PM, Christopher Samuel wrote: On 3/20/19 4:20

Re: [slurm-users] Database Tuning w/SLURM

2019-03-21 Thread Prentice Bisbal
On 3/21/19 1:56 PM, Ryan Novosielski wrote: On Mar 21, 2019, at 12:21 PM, Loris Bennett wrote: Our last cluster only hit around 2.5 million jobs after around 6 years, so database conversion was never an issue. For sites with a higher-throughput things may be different, but I would hope tha

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Prentice Bisbal
On 3/21/19 4:40 PM, Reuti wrote: Am 21.03.2019 um 16:26 schrieb Prentice Bisbal : On 3/20/19 1:58 PM, Christopher Samuel wrote: On 3/20/19 4:20 AM, Frava wrote: Hi Chris, thank you for the reply. The team that manages that cluster is not very fond of upgrading SLURM, which I understand

[slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-03-21 Thread Prentice Bisbal
Slurm-users, My users here have developed a GUI application which serves as a GUI interface to various physics codes they use. From this GUI, they can submit jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to 18.08.6-2,and a user has reported a problem when submitting Slurm jobs t

Re: [slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-03-22 Thread Prentice Bisbal
On 3/21/19 6:56 PM, Reuti wrote: Am 21.03.2019 um 23:43 schrieb Prentice Bisbal: Slurm-users, My users here have developed a GUI application which serves as a GUI interface to various physics codes they use. From this GUI, they can submit jobs to Slurm. On Tuesday, we upgraded Slurm from

Re: [slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-03-22 Thread Prentice Bisbal
On 3/22/19 12:40 PM, Reuti wrote: Am 22.03.2019 um 16:20 schrieb Prentice Bisbal : On 3/21/19 6:56 PM, Reuti wrote: Am 21.03.2019 um 23:43 schrieb Prentice Bisbal: Slurm-users, My users here have developed a GUI application which serves as a GUI interface to various physics codes they use

Re: [slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-03-22 Thread Prentice Bisbal
ugh I confess I am unable to think of any reasonable environmental setting that might cause the observed symptoms. On Fri, Mar 22, 2019 at 11:23 AM Prentice Bisbal <mailto:pbis...@pppl.gov>> wrote: On 3/21/19 6:56 PM, Reuti wrote: > Am 21.03.2019 um 23:43 schrieb

Re: [slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-03-22 Thread Prentice Bisbal
Chris, I use that -x switch all the time in other situations. Don't know why I didn't think of using it in this one. Thanks for reminding me of that. Prentice On 3/22/19 1:18 PM, Christopher Samuel wrote: On 3/21/19 3:43 PM, Prentice Bisbal wrote: #!/bin/tcsh Old school script

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-22 Thread Prentice Bisbal
prior. They told me that they actually plan to update SLURM but not until late 2019 because they have other things to do before that. Also, I'm the only one asking for heterogeneous jobs... Rafael. Le jeu. 21 mars 2019 à 22:19, Prentice Bisbal <mailto:pbis...@pppl.gov>> a écrit 

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-22 Thread Prentice Bisbal
This is the first place I've had regularly scheduled maintenance, too, and boy does it make life easier. In most of my previous jobs, it was a small enough environment that it wasn't necessary. On 3/22/19 1:57 PM, Christopher Samuel wrote: On 3/22/19 10:31 AM, Prentice Bisbal wro

Re: [slurm-users] strange resource allocation issue - thoughts?

2019-03-27 Thread Prentice Bisbal
On 3/23/19 2:16 PM, Sharma, M D wrote: Hi folks, By default slurm allocates the whole node for a job (even if it specifically requested a single core). This is usually taken care of by adding SelectType=select/cons_res along with an appropriate parameter such as SelectTypeParameters=CR_Core_

Re: [slurm-users] Multinode MPI job

2019-03-27 Thread Prentice Bisbal
On 3/25/19 8:09 AM, Mahmood Naderan wrote: Hi Is it possible to submit a multinode mpi job with the following config: Node1: 16 cpu, 90GB Node2: 8 cpu, 20GB ? Regards, Mahmood Yes: sbatch -n 24 -w  Node1,Node2 That will allocate 24 cores (tasks, technically) to your job, and only use Node

Re: [slurm-users] Multinode MPI job

2019-03-27 Thread Prentice Bisbal
On 3/27/19 11:25 AM, Christopher Samuel wrote: On 3/27/19 8:07 AM, Prentice Bisbal wrote: sbatch -n 24 -w  Node1,Node2 That will allocate 24 cores (tasks, technically) to your job, and only use Node1 and Node2. You did not mention any memory requirements of your job, so I assumed memory is

Re: [slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-04-01 Thread Prentice Bisbal
On 3/28/19 1:25 PM, Reuti wrote: Hi, Am 22.03.2019 um 16:20 schrieb Prentice Bisbal : On 3/21/19 6:56 PM, Reuti wrote: Am 21.03.2019 um 23:43 schrieb Prentice Bisbal: Slurm-users, My users here have developed a GUI application which serves as a GUI interface to various physics codes

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2019-04-03 Thread Prentice Bisbal
the dev stated that they’d rather keep that warning than fixing the issue, so I’m not sure if that’ll be enough to convince them. Anyone else as disappointed by this as I am? I get that it's too late to add something like this to 17.11 or 18.08, but it seems like SchedMD isn't even interested i

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2019-04-04 Thread Prentice Bisbal
15:33 schrieb Prentice Bisbal : the dev stated that they’d rather keep that warning than fixing the issue, so I’m not sure if that’ll be enough to convince them. Anyone else as disappointed by this as I am? I get that it's too late to add something like this to 17.11 or 18.08, but it seems li

Re: [slurm-users] Power9 ACC922

2019-04-16 Thread Prentice Bisbal
Sergi, I'm working with Bill on this project. Is all the hardware identification/mapping and task affinity working as expected/desired with the Power9? I assume your answer implies "yes", but I just want to make sure. Prentice On 4/16/19 10:37 AM, Sergi More wrote: Hi, We have a Power9 cl

Re: [slurm-users] Power9 ACC922

2019-04-16 Thread Prentice Bisbal
. Getting POWERAI running correctly (bugs since fixed in newer release) and apps properly built and linked to ESSL was the long march. regards, s On Tue, Apr 16, 2019 at 12:59 PM Prentice Bisbal <mailto:pbis...@pppl.gov>> wrote: Sergi, I'm working with Bill on this proje

Re: [slurm-users] Pending with resource problems

2019-04-17 Thread Prentice Bisbal
Mahmood, What do you see as the problem here? To me, there is no problem and the scheduler is working exactly has it should. The reason "Resources" means that there are not enough computing resources available for your job to run right now, so the job is setting in the queue in the pending sta

Re: [slurm-users] Pending with resource problems

2019-04-17 Thread Prentice Bisbal
n specified. I thought that slurm will dynamically handles that in order to put more jobs in running state. Regards, Mahmood On Wed, Apr 17, 2019 at 7:54 PM Prentice Bisbal <mailto:pbis...@pppl.gov>> wrote: Mahmood, What do you see as the problem here? To me, there i

[slurm-users] Increasing job priority based on resources requested.

2019-04-18 Thread Prentice Bisbal
Slurm-users, Is there away to increase a jobs priority based on the resources or constraints it has requested? For example, we have a very heterogeneous cluster here: Some nodes only have 1 Gb Ethernet, some have 10 Gb Ethernet, and others have DDR IB. In addition, we have some large memory

Re: [slurm-users] Increasing job priority based on resources requested.

2019-04-19 Thread Prentice Bisbal
Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Apr 18, 2019, at 5:20 PM, Prentice Bisbal wrote: Slurm-users, Is there away to increase a jobs prio

Re: [slurm-users] Increasing job priority based on resources requested.

2019-04-19 Thread Prentice Bisbal
ble nodes they fit in, and the larger or more feature-rich nodes have a kind of soft reservation either for large jobs or for busy times. Cheers, Chris - Original Message - From: "Prentice Bisbal" To: slurm-users@lists.schedmd.com Sent: Friday, April 19, 2019 11:27:08 AM Subjec

Re: [slurm-users] Increasing job priority based on resources requested.

2019-04-22 Thread Prentice Bisbal
smaller memory jobs to go only to low memory nodes and keep large memory nodes free from trash jobs. The disadvantage is that large mem nodes would wait idle if only low mem jobs are in the queue. cheers, P - Original Message ----- From: "Prentice Bisbal" To: slurm-users@li

Re: [slurm-users] Job dispatching policy

2019-04-23 Thread Prentice Bisbal
On 4/23/19 2:47 AM, Mahmood Naderan wrote: Hi, How can I change the job distribution policy? Since some nodes are running non-slurm jobs, it seems that the dispatcher isn't aware of system load. Therefore, it assumes that the node is free. I want to change the policy based on the system load

Re: [slurm-users] Limit concurrent gpu resources

2019-04-24 Thread Prentice Bisbal
Here's how we handle this here: Create a separate partition named debug that also contains that node. Give the debug partition a very short timelimit, say 30 - 60 minutes. Long enough for debugging, but too short to do any real work. Make the priority of the debug partition much higher than t

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Prentice Bisbal
We're running nscd on all nodes, with an extremely stable list of users/accounts, so I think we should be good here. Don't bet on it. I've had issues in the past with nscd in similar situations to this.  There's a reason that daemon has a "paranoid" option. Hostname should be completely local

Re: [slurm-users] Job dispatching policy

2019-04-29 Thread Prentice Bisbal
I see two separate, unrelated problems here: Problem 1: Warning: untrusted X11 forwarding setup failed: xauth key data not generated What have you done to investigate this xauth problem further? I know there have been discussions about this problem in the past on this mailing list. Did you

Re: [slurm-users] [External] Proposal for new TRES - "Processor Performance Units"....

2019-06-21 Thread Prentice Bisbal
In this case, I would run LINPACK on each generation of node (either the full node or just one core), and then somehow normalize performance. I  would recommend using the performance of a single core of the slowest node as your basis for normalization so it has a multiplier of 1, and then the n

[slurm-users] Slurm showing 100% utilization since last maintenance window

2019-07-11 Thread Prentice Bisbal
I have a strange issue: sreport is showing 100% utilization for our cluster every day since June 18. What is interesting about this is June 18th was our last maintenance outage, when all the nodes were rebooted, including our slurm server which runs both slurmdbd and slurmctld. Has anyone else

[slurm-users] Invalid qos specification

2019-07-15 Thread Prentice Bisbal
Slurm users, I have created a partition named general should allow the QOSes 'general' and 'debug': PartitionName=general Default=YES AllowQOS=general,debug Nodes=. However, when I try to request that QOS, I get an error: $ salloc -p general -q debug  -t 00:30:00 salloc: error: Job submi

Re: [slurm-users] Invalid qos specification

2019-07-15 Thread Prentice Bisbal
I should add that I still get this error even when I remove the "AllowQOS" attribute from the partition definition altogether: $ salloc -p general -q debug  -t 00:30:00 salloc: error: Job submit/allocate failed: Invalid qos specification Prentice On 7/15/19 2:22 PM, Prentice Bi

Re: [slurm-users] [External] Re: Invalid qos specification

2019-07-15 Thread Prentice Bisbal
wrote: On 7/15/19 11:22 AM, Prentice Bisbal wrote: $ salloc -p general -q debug  -t 00:30:00 salloc: error: Job submit/allocate failed: Invalid qos specification what does: scontrol show part general say? Also, does the user you're testing as have access to that QOS? All the best, Chris

Re: [slurm-users] [External] Re: Invalid qos specification

2019-07-15 Thread Prentice Bisbal
sacctmgr. Right now I'd say if you did sacctmgr show user withassoc that the QoS you're attempting to use is NOT listed as part of the association. On Mon, Jul 15, 2019 at 2:53 PM Prentice Bisbal <mailto:pbis...@pppl.gov>> wrote: Slurm users, I have created a part

Re: [slurm-users] [External] Re: Invalid qos specification

2019-07-15 Thread Prentice Bisbal
Account=unix QOS=debug I will go stand in the corner now... Prentice Bisbal Lead Software Engineer Princeton Plasma Physics Laboratory http://www.pppl.gov On 7/15/19 3:18 PM, Prentice Bisbal wrote: That explanation makes perfect sense, but after adding debug to my list of QOSes in my associations

Re: [slurm-users] [External] Scheduling GPUS

2019-11-11 Thread Prentice Bisbal
Remove this line: #SBATCH --nodes=1 Slurm assumes you're requesting the whole node. --ntasks=1 should be adequate. On 11/7/19 4:19 PM, Mike Mosley wrote: Greetings all: I'm attempting to  configure the scheduler to schedule our GPU boxes but have run into a bit of a snag. I have a box wi

[slurm-users] Get GPU usage from sacct?

2019-11-14 Thread Prentice Bisbal
Is there any way to see how much a job used the GPU(s) on a cluster using sacct or any other slurm command? -- Prentice

Re: [slurm-users] [External] Re: Get GPU usage from sacct?

2019-11-14 Thread Prentice Bisbal
/19 1:48 PM, Ryan Novosielski wrote: Do you mean akin to what some would consider "CPU efficiency" on a CPU job? "How much... used" is a little vague. From: slurm-users on behalf of Prentice Bisbal Sent: Thursday, November 14,

Re: [slurm-users] [External] Re: Get GPU usage from sacct?

2019-11-15 Thread Prentice Bisbal
Thanks! Prentice On 11/15/19 6:58 AM, Janne Blomqvist wrote: On 14/11/2019 20.41, Prentice Bisbal wrote: Is there any way to see how much a job used the GPU(s) on a cluster using sacct or any other slurm command? We have created https://github.com/AaltoScienceIT/ansible-role-sacct_gpu/ as a

Re: [slurm-users] [External] Is that possible to submit jobs to a Slurm cluster right from a developer's PC

2019-12-13 Thread Prentice Bisbal
Does Slurm provide an option to allow developers submit jobs right from their own PCs? Yes. They just  need to have the relevant Slurm packages installed, and the necessary configuration file(s). Prentice On 12/11/19 11:39 PM, Victor (Weikai) Xie wrote: Hi, We are trying to setup a tiny

Re: [slurm-users] [External] Re: Partition question

2019-12-19 Thread Prentice Bisbal
On 12/19/19 10:44 AM, Ransom, Geoffrey M. wrote: The simplest is probably to just have a separate partition that will only allow job times of 1 hour or less. This is how our Univa queues used to work, by overlapping the same hardware. Univa shows available “slots” to the users and we had a

Re: [slurm-users] [External] Re: Why does the make install path get hard coded into the slurmd binary?

2020-02-19 Thread Prentice Bisbal
The binaries are probably compiled with the RPATH switch enabled. https://en.wikipedia.org/wiki/Rpath Prentice On 2/18/20 6:18 PM, Dean Schulze wrote: The ./configure --prefix=... value gets into the Makefiles, which is no surprise.  But it is also getting into the slurmd binary and .o files.

[slurm-users] 19.05 not recognizing DefMemPerCPU?

2020-03-23 Thread Prentice Bisbal
Last week I upgraded from Slurm 18.08 to Slurm 19.05. Since that time, several users have reported to me that they can't submit jobs without specifying a memory requirement. In a way, this is intended - my job_submit.lua script checks to make sure that --mem or --mem-per-node is specified, and

Re: [slurm-users] [External] Question about partition and node allocation

2020-04-08 Thread Prentice Bisbal
I believe you can do this with QOS, by assigning group limits to the QOS. Prentice On 4/8/20 1:38 PM, Renata Maria Dart wrote: Hi, is there a way to specify a certain number of nodes for a partition without specifying exactly which nodes to use? For instance, if I have 100 hosts and would lik

[slurm-users] Building Slurm on CentOS 7 with PMIx support

2020-04-24 Thread Prentice Bisbal
Slurm-users, Is there anything special needed to build Slurm 19.05.5 on CentOS 7 with PMIx support? We have the pmix and pmix-devel packages that come with CentOS 7 installed, and built the RPMs with the following rpmbuild command: rpmbuild -ta --with munge --with pam --with lua  --with pmix

[slurm-users] jobs not running with srun

2020-04-24 Thread Prentice Bisbal
We are in the process of upgrading to CentOS 7, and have built Slurm 19.05.5 and OpenMPI 4.0.3 for CentOS 7. When I submit that launches using srun, the job appears to be running according to squeue, (state = R), but the program doesn't do anything. I'm testing with a simple Hello, World progra

Re: [slurm-users] [External] Re: ssh-keys on compute nodes?

2020-06-09 Thread Prentice Bisbal
out needing hackish extra utilities to create and manage cluster-specific passphraseless key pairs for every single user! :-) There's a great cookbook online that tells you step-by-step how to set it up: https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Host-based_Authentication HTH! Michael -

Re: [slurm-users] [External] Re: ssh-keys on compute nodes?

2020-06-10 Thread Prentice Bisbal
le Holm Nielsen wrote: Hi Prentice, Could you kindly elaborate on this statement?  Is host-based security safe inside a compute cluster compared to user-based SSH keys? Thanks, Ole On 09-06-2020 21:26, Prentice Bisbal wrote: Host-based security is not considered as safe as user-based security, so

[slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Prentice Bisbal
I maintain a very heterogeneous cluster (different processors, different amounts of RAM, etc.) I have a user reporting the following problem. He's running the same job multiple times with different input parameters. The jobs run fine unless they land on specific nodes. He's specifying --mem=2G

Re: [slurm-users] [External] Re: Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Prentice Bisbal
of NJ     | Office of Advanced Research Computing - MSB C630, Newark `' On Jul 2, 2020, at 09:53, Prentice Bisbal wrote: I maintain a very heterogeneous cluster (different processors, different amounts of RAM, etc.) I have a user reporting the following problem. He's running t

Re: [slurm-users] [External] Re: Internet connection loss with srun to a node

2020-08-03 Thread Prentice Bisbal
^C --- google.com <http://google.com> ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2026ms I guess that is related to slurm and srun. Any idea for that? Regards, Mahmood -- Prentice Bisbal Lead Software Engineer Research Computing Princeton Plasma Physics Laboratory http://www.pppl.gov

Re: [slurm-users] [External] Cancel "reboot ASAP" for a node

2020-08-07 Thread Prentice Bisbal
=0 ExtSensorsTemp=n/s Reason=Reboot ASAP [root@2020-08-06T10:29:22] Any thoughts as to how to cancel the reboot? Mike Hanby mhanby @ uab.edu Systems Analyst III - Enterprise IT Research Computing Services The University of Alabama at Birmingham -- Prentice Bisbal Lead Softwa

Re: [slurm-users] [External] Re: openmpi / UCX / srun

2020-08-12 Thread Prentice Bisbal
Max, You didn't quote the original e-mail so I'm not sure what the original problem was, or who "you" is. -- Prentice On 8/12/20 6:55 AM, Max Quast wrote: I am also trying to use ucx with slurm/PMIx and get the same error.  Also mpirun with "--mca pml ucx" works fine. Used versions: Ubu

Re: [slurm-users] [External] Limit nodes of a partition without managing users

2020-08-17 Thread Prentice Bisbal
Yes, you can do this using Slurm's QOS facility to limit the number of nodes used simultaneously, for the high-priority partition you can use the GrpTRES setting to limit how many nodes or CPUs a QOS can use. -- Prentice On 8/17/20 1:13 PM, Gerhard Strangar wrote: Hello, I'm wondering if it'

Re: [slurm-users] [External] Re: Simple free for all cluster

2020-10-22 Thread Prentice Bisbal
y to try and find that balance between good utilisation, effective use of the system and reaching the desired science/research/development outcomes. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- Prentice Bisbal Lead Software Engineer Research Computing Princeton Plasma Physics Laboratory http://www.pppl.gov

Re: [slurm-users] [External] Limit usage outside reservation

2020-10-22 Thread Prentice Bisbal
On 10/20/20 3:01 AM, SJTU wrote: Hi, We reserved compute node resource on SLURM for specific users and hope they will make good use of it. But in some cases users forgot the '--reservation' parameter in job scripts, competing with other users outside the reserved nodes. Is there a recommended

Re: [slurm-users] [External] Munge thinks clocks aren't synced

2020-10-27 Thread Prentice Bisbal
w. If you are not the intended recipient, any disclosure, distribution or other use of this e-mail message or attachments is prohibited. If you have received this e-mail message in error, please delete and notify the sender immediately. Thank you. -- Prentic

Re: [slurm-users] [External] Munge thinks clocks aren't synced

2020-10-29 Thread Prentice Bisbal
node, but I didn’t anticipate ntp and munge issues. Thanks, Gard *From: *slurm-users on behalf of Prentice Bisbal *Reply-To: *Slurm User Community List *Date: *Tuesday, October 27, 2020 at 12:22 PM *To: *"slurm-users@lists.schedmd.com" *Subject: *Re: [slurm-users] [External] Mun

Re: [slurm-users] [External] Munge thinks clocks aren't synced

2020-10-29 Thread Prentice Bisbal
up a node on a different subnet. I figured it be simple to point slurm to the new node, but I didn’t anticipate ntp and munge issues. Thanks, Gard *From: *slurm-users <mailto:slurm-users-boun...@lists.schedmd.com> on behalf of Prentice Bisbal <mailto:pbis...@ppp

Re: [slurm-users] [External] Re: can't lengthen my jobs log

2020-12-04 Thread Prentice Bisbal
I know I'm very late to this thread, but were/are you using the --allusers flag to sacct? If not, sacct only returns results for the user running the command (not sure if this is the case for root - I never need to run sacct as root). This minor detail tripped me up a few days ago when I was ex

Re: [slurm-users] [External] Slurm Upgrade Philosophy?

2020-12-23 Thread Prentice Bisbal
We generally upgrade within 1-2 maintenance windows of a new release coming out, so within a couple of months of the release being available. For minor updates, we update at the next maintenance window. At one point, we were stuck several release behind. Getting all caught up wasn't that bad. I

[slurm-users] Moving Slurmctld and slurmdbd to a new host

2021-01-15 Thread Prentice Bisbal
Slurm users, I'm planning on moving slurmctld and slurmdbd to a new host. I know how to dump the MySQL DB from the old server and import it to the new slurmdbd host, and I know how to copy the job state directories to the new host. I plan on doing this during our next maintenance window when

Re: [slurm-users] [External] Re: Moving Slurmctld and slurmdbd to a new host

2021-01-19 Thread Prentice Bisbal
  `' On Jan 15, 2021, at 13:44, Prentice Bisbal mailto:pbis...@pppl.gov>> wrote: Slurm users, I'm planning on moving slurmctld and slurmdbd to a new host. I know how to dump the MySQL DB from the old server and import it to the new slurmdbd host, and I know ho

Re: [slurm-users] [External] Re: Using "Environment Modules" in a SLURM script

2021-01-25 Thread Prentice Bisbal
ronment variables into the script or is it a problem due to my distribution (CentOS-7)??? Thanks. -- Tom Payerle DIT-ACIGS/Mid-Atlantic Crossroads paye...@umd.edu <mailto:paye...@umd.edu> 5825 University Research Park   (301) 405-6135 University of Maryland College Park, MD 20740

  1   2   >