[slurm-users] Determine usage for a QOS?

2018-08-19 Thread Christopher Samuel
Hi folks, After an extended hiatus (I forgot to resubscribe after going away for a few weeks) I'm back.. ;-) We are using QOS's for projects which have been granted a fixed set of time for higher priority work which works nicely, but have just been asked the obvious question "how much time do we

Re: [slurm-users] Determine usage for a QOS?

2018-08-19 Thread Christopher Samuel
Hi Paul, On 20/08/18 11:36, Paul Edmon wrote: I don't really have enough experience with QoS's to give a slicker method but you could use squeue --qos to poll the QoS and then write a wrapper to do the summarization.  It's hacky but it should work. I was thinking sacct -q ${QOS} to pull info

Re: [slurm-users] how can users start their worker daemons using srun?

2018-08-28 Thread Christopher Samuel
On 29/08/18 09:10, Priedhorsky, Reid wrote: This is surprising to me, as my interpretation is that the first run should allocate only one CPU, leaving 35 for the second srun, which also only needs one CPU and need not wait. Is this behavior expected? Am I missing something? That's odd - and I

Re: [slurm-users] ubuntu 16.04 > 18.04

2018-09-13 Thread Christopher Samuel
On 13/09/18 03:44, A wrote: Thinking about upgrading to Ubuntu 18.04 on my workstation, where I am running a single node slurm setup. Any issues any one has run across in the update? If you are using slurmdbd that's too large a jump, you'll need to upgrade to an intermediate version first. Th

Re: [slurm-users] swap size

2018-09-23 Thread Christopher Samuel
On 24/09/18 00:46, Raymond Wan wrote: Hmm, I'm way out of my comfort zone but I am curious about what happens.  Unfortunately, I don't think I'm able to read kernel code, but someone here (https://stackoverflow.com/questions/31946854/how-does-sigstop-work-in-linux-kernel) seems to suggest

Re: [slurm-users] pam_slurm_adopt does not constrain memory?

2018-10-24 Thread Christopher Samuel
On 24/10/18 9:37 pm, Chris Samuel wrote: We're on 17.11.7 (for the moment, starting to plan upgrade to 18.08.x). From the NEWS file in 17.11.x (in this case for 17.11.10): -- Fix pam_slurm_adopt to honor action_adopt_failure. Could explain why this isn't something we see consistently, and w

Re: [slurm-users] pam_slurm_adopt does not constrain memory?

2018-10-24 Thread Christopher Samuel
On 25/10/18 2:29 pm, Christopher Samuel wrote: Could explain why this isn't something we see consistently, and why we're both seeing it currently. This seems to be a handy way to find any processes that are not properly constrained by Slurm cgroups on compute nodes (at le

Re: [slurm-users] Accounting - running with 'wrong' account on cluster

2018-11-06 Thread Christopher Samuel
On 7/11/18 7:35 am, Brian Andrus wrote: I am able to submit using account=projectB on cluster3. ??? Since 'projectB' is a child of account ' DevOps', which is only associated with cluster1 and cluster2, shouldn't I be denied the ability to run using that accout on cluster3? What does this sa

Re: [slurm-users] Accounting - running with 'wrong' account on cluster

2018-11-06 Thread Christopher Samuel
On 7/11/18 1:57 pm, Brian Andrus wrote: Ah. I thought I had set that. So I did and now it is: AccountingStorageEnforce = associations,limits But I am still able to request and get resources on cluster3 using projectA as my account.. Heck, I just tried using a fake account (account=asdas) and

Re: [slurm-users] Accounting - running with 'wrong' account on cluster

2018-11-06 Thread Christopher Samuel
On 7/11/18 2:44 pm, Brian Andrus wrote: Ah just scontrol reconfigure doesn't actually make it take effect. Restarting slurmctld did it. Phew! Glad to hear that's sorted out.. :-) -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Re: [slurm-users] constraints question

2018-11-11 Thread Christopher Samuel
Hi Doug, On 12/11/18 8:34 am, Douglas Jacobsen wrote: I think you'll need to update to 18.08 to get this working, constraint arithmetic and knl were not compatible until that release. Thanks! That's planned for us today (though we're not using constraints) and from the sound of it Tina should

Re: [slurm-users] new user simple question re sacct output line2

2018-11-14 Thread Christopher Samuel
On 15/11/18 12:38 am, Matthew Goulden wrote: sacct output including the default headers is three lines, What is line 2 documenting? Most fields are blank. Ah, well it can be more that 3 lines.. ;-) [csamuel@farnarkle2 tmp]$ sbatch --wrap hostname Submitted batch job 1740982 When I use sacct

Re: [slurm-users] How to check the percent cpu of a job?

2018-11-21 Thread Christopher Samuel
On 22/11/18 5:41 am, Ryan Novosielski wrote: You can see, both of the above are examples of jobs that have allocated CPU numbers that are very different from the ultimate CPU load (the first one using way more than allocated, though they’re in a cgroup so theoretically isolated from the other us

Re: [slurm-users] About x11 support

2018-11-21 Thread Christopher Samuel
On 22/11/18 5:04 am, Mahmood Naderan wrote: The idea is to have a job manager that find the best node for a newly submitted job. If the user has to manually ssh to a node, why one should use slurm or any other thing? You are in a really really unusual situation - in 15 years I've not come ac

Re: [slurm-users] $TMPDIR does not honor "TmpFS"

2018-11-21 Thread Christopher Samuel
On 22/11/18 12:38 am, Douglas Duckworth wrote: We are setting TmpFS=/scratchLocal in /etc/slurm/slurm.conf on nodes and controller. However $TMPDIR value seems to be /tmp not /scratchLocal. As a result users are writing to /tmp which we do not want. Our solution to that was to use a plugin th

Re: [slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Samuel
On 1/31/19 8:12 AM, Christopher Benjamin Coffey wrote: This seems to be related to jobs that can't start due to in our case: AssocGrpMemRunMinutes, and AssocGrpCPURunMinutesLimit Must be a bug relating to GrpTRESRunLimit it seems. Do you mean can't start due to not enough time, or can't star

Re: [slurm-users] Segmentation fault when launching mpi jobs using Intel MPI

2019-02-06 Thread Christopher Samuel
On 2/6/19 9:06 AM, Bob Smith wrote: Any ideas on what is going on? Any reason you're not using "srun" to launch your code? https://slurm.schedmd.com/mpi_guide.html All the best, Chris

Re: [slurm-users] Analyzing a stuck job

2019-02-14 Thread Christopher Samuel
On 2/14/19 8:02 AM, Mahmood Naderan wrote: One job is in RH state which means JobHoldMaxRequeue. The output file, specified by --output shows nothing suspicious. Is there any way to analyze the stuck job? This happens when a job fails to start for MAX_BATCH_REQUEUE times (which is 5 at the mo

Re: [slurm-users] Strange error, submission denied

2019-02-14 Thread Christopher Samuel
On 2/14/19 12:22 AM, Marcus Wagner wrote: CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=191905 That's different to what you put in your config in the original email though. There you had: CPUs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=2 This config

Re: [slurm-users] Reservation with memory

2019-02-15 Thread Christopher Samuel
On 2/15/19 7:17 AM, Arnaud Renard URCA wrote: Does any of you have a solution to consider memory when creating a reservation ? I don't think memory is currently supported for reservations via TRES, it's certainly not listed in the manual page for scontrol either in 18.08 or in master (which

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Christopher Samuel
On 2/22/19 3:54 PM, Aaron Jackson wrote: Happy to answer any questions about our setup. If folks are interested in a mailing list where this discussion would be decidedly on-topic then I'm happy to add people to the Beowulf list where there's a lot of other folks with expertise in this are

Re: [slurm-users] pmix and ucx versions compatibility with slurm

2019-02-26 Thread Christopher Samuel
On 2/26/19 5:13 AM, Daniel Letai wrote: I couldn't find any documentation regarding which api from pmix or ucx Slurm is using, and how stable those api are. There is information about PMIx at least on the SchedMD website: https://slurm.schedmd.com/mpi_guide.html#pmix For UCX I'd suggest test

[slurm-users] Slurm message aggregation

2019-03-04 Thread Christopher Samuel
Hi folks, Anyone here tried Slurm's message aggregation (MsgAggregationParams in slurm.conf) at all? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] seff: incorrect memory usage (18.08.5-2)

2019-03-04 Thread Christopher Samuel
On 2/26/19 5:49 AM, Marcus Wagner wrote: If I remember right, there was a discussion lately in this list regarding the JobAcctGatherType, yet I do not remember the outcame It used to be that SchedMD would strongly recommend the non-group way of gathering information, but that never really wor

Re: [slurm-users] Slurm message aggregation

2019-03-05 Thread Christopher Samuel
On 3/5/19 6:58 AM, Paul Edmon wrote: We tried it once back when they first introduced it and shelved it after we found that we didn't really need it. Thanks Paul. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] How to enable QOS correctly?

2019-03-05 Thread Christopher Samuel
On 3/5/19 7:37 AM, Matthew BETTINGER wrote: Every time we attempt this no one can submit a job, slurm says waiting on resources I believe. We have accounting enabled and everyone is a member of the default qos group "normal". Is it also their default QOS? Do you still have the slurmctld l

Re: [slurm-users] What is the 2^32-1 values in "stepd_connect to .4294967295 failed" telling you

2019-03-08 Thread Christopher Samuel
On 3/8/19 12:25 AM, Kevin Buckley wrote: error: stepd_connect to .1 failed: No such file or directory error: stepd_connect to .4294967295 failed: No such file or directory We can imagine why a job that got killed in step 0 might still be looking for the .1 step but the .2^32-1 is beyond our i

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-19 Thread Christopher Samuel
On 3/19/19 5:31 AM, Peter Steinbach wrote: For example, let's say I have a 4-core GPU node called gpu1. A non-GPU job $ sbatch --wrap="sleep 10 && hostname" -c 3 Can you share the output for "scontrol show job [that job id]" once you submit this please? Also please share "scontrol show node

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-20 Thread Christopher Samuel
On 3/20/19 9:09 AM, Peter Steinbach wrote: Interesting enough, if I add Cores=0-1 and Cores=2-3 to the gres.conf file, everything stops working again. :/ Should I send around scontrol outputs? And yes, I watched out to set the --mem flag for the job submission this time. Well there you've sa

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-20 Thread Christopher Samuel
On 3/20/19 4:20 AM, Frava wrote: Hi Chris, thank you for the reply. The team that manages that cluster is not very fond of upgrading SLURM, which I understand. Do be aware that Slurm 17.11 will stop being maintained once 19.05 is released in May. So basically my heterogeneous job that only

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Christopher Samuel
On 3/21/19 9:21 AM, Loris Bennett wrote: Chris, maybe you should look at EasyBuild (https://easybuild.readthedocs.io/en/latest/). That way you can install all the dependencies (such as zlib) as modules and be pretty much independent of the ancient packages your distro may provide (other softwar

Re: [slurm-users] Very large job getting starved out

2019-03-21 Thread Christopher Samuel
On 3/21/19 6:55 AM, David Baker wrote: it currently one of the highest priority jobs in the batch partition queue What does squeue -j 359323 --start say? -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-03-22 Thread Christopher Samuel
On 3/21/19 3:43 PM, Prentice Bisbal wrote: #!/bin/tcsh Old school script debugging trick - make that line: #!/bin/tcsh -x and then you'll see everything the script is doing. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-22 Thread Christopher Samuel
On 3/22/19 10:31 AM, Prentice Bisbal wrote: Most HPC centers have scheduled downtime on a regular basis. That's not my experience before now, where I've worked in Australia we scheduled maintenance for when we absolutely had to do them, but there could be delays to them if there were critica

Re: [slurm-users] Multinode MPI job

2019-03-27 Thread Christopher Samuel
On 3/27/19 8:07 AM, Prentice Bisbal wrote: sbatch -n 24 -w  Node1,Node2 That will allocate 24 cores (tasks, technically) to your job, and only use Node1 and Node2. You did not mention any memory requirements of your job, so I assumed memory is not an issue and didn't specify any in my comman

Re: [slurm-users] Multinode MPI job

2019-03-27 Thread Christopher Samuel
On 3/27/19 8:39 AM, Mahmood Naderan wrote: mpirun pw.x -imos2.rlx.in You will need to read the documentation for this: https://slurm.schedmd.com/heterogeneous_jobs.html Especially note both of these: IMPORTANT: The ability to execute a single application across more th

Re: [slurm-users] Multinode MPI job

2019-03-27 Thread Christopher Samuel
On 3/27/19 11:29 AM, Mahmood Naderan wrote: Thank you very much. you are right. I got it. Cool, good to hear. I'd love to hear whether you get heterogenous MPI jobs working too! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Christopher Samuel
On 4/11/19 8:27 AM, Randall Radmer wrote: I guess my next question is, are there any negative repercussions to setting "Delegate=yes" in slurmd.service? This was Slurm bug 5292 and was fixed last year: https://bugs.schedmd.com/show_bug.cgi?id=5292 # Commit cecb39ff087731d2 adds Delegate=yes

Re: [slurm-users] disable-bindings disables counting of gres resources

2019-04-15 Thread Christopher Samuel
On 4/15/19 8:15 AM, Peter Steinbach wrote: We had a feeling that cgroups might be more optimal. Could you point us to documentation that suggests cgroups to be a requirement? Oh it's not a requirement, just that without it there's nothing to stop a process using GPUs outside of its allocation

Re: [slurm-users] Scontrol update: invalid user id

2019-04-15 Thread Christopher Samuel
On 4/15/19 3:03 PM, Andy Riebs wrote: Run "slurmd -Dvv" as root on one of the compute nodes and it will show you what it thinks is the socket/core/thread configuration. In fact: slurmd -C will tell you what it discovers in a way that you can use in the configuration file. All the best, Ch

Re: [slurm-users] How to apply for multiple GPU cards from different worker nodes?

2019-04-16 Thread Christopher Samuel
On 4/16/19 1:15 AM, Ran Du wrote:       And another question is : how to apply for multiple cards could not be divided exactly by 8? For example, to apply for 10 GPU cards, 8 cards on one node and 2 cards on another node? There are new features coming in 19.05 for GPUs to better support them

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Christopher Samuel
On 4/26/19 7:29 AM, Riebs, Andy wrote: In a separate test that I had missed, even "srun hostname" took 5 minutes to run. So there was no remote file system or MPI involvement. Worth trying: srun /bin/hostname Just in case there's something weird in the path that causes it to hit a network

Re: [slurm-users] Issue with x11

2019-05-14 Thread Christopher Samuel
On 5/14/19 4:00 PM, Mahmood Naderan wrote: srun: error: Cannot forward to local display. Can only use X11 forwarding with network displays. What does this say? echo $DISPLAY All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Issue with x11

2019-05-14 Thread Christopher Samuel
On 5/14/19 5:09 PM, Mahmood Naderan wrote: Should I modify that parameter on compute-0-0 too? No, but you'll need to logout of rocks7 and ssh back into it. Or are you on the console of the system itself? -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Issue with x11

2019-05-15 Thread Christopher Samuel
On 5/15/19 7:32 AM, Tina Friedrich wrote: Hadn't yet read that far - I plan to test 19.05 soon anyway. Will report. Cool, Tim has ripped out all the libssh code (which caused me issues at ${JOB-1} because it didn't play nicely with SSH keep alive messages) and replaced it with native handling

Re: [slurm-users] Issue with x11

2019-05-15 Thread Christopher Samuel
On 5/15/19 11:36 AM, Mahmood Naderan wrote: I really like to know why x11 is not so friendly? For example, slurm works with MPI. Why not with X11?! Because MPI support is fundamental, X11 support is nice to have. I suspect 19.05 will make your life an awful lot easier! All the best, Chris --

Re: [slurm-users] Issue with x11

2019-05-16 Thread Christopher Samuel
On 5/16/19 8:53 AM, Mahmood Naderan wrote: Can I ask what is the expected release date for 19? It seems that rc1 has been released in theMay? Sometime in May hopefully! -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Issue with x11

2019-05-16 Thread Christopher Samuel
On 5/16/19 1:04 AM, Alan Orth wrote: but now we get a handful of nodes drained every day with reason "Kill task failed". In ten years of using SLURM I've never had so many problems as I'm having now. :\ We see "kill task failed" issues but as Marcus says that's not related to X11 support, wh

Re: [slurm-users] Slurm stopped obeying QOS limits - cant figure out why...

2019-05-22 Thread Christopher Samuel
On 5/22/19 6:34 AM, Aravindh Sampathkumar wrote: Nothing has changed recently, and today, I noticed that the QOS limits which were working until now has silently stopped working. A user was able to submit jobs enough to saturate the cluster singlehandedly annoying other users. Can you check

Re: [slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

2019-06-06 Thread Christopher Samuel
On 6/6/19 10:21 AM, Levi Morrison wrote: This means all OpenMPI programs that end up calling `srun` on Slurm 19.05 will fail. Sounds like a good reason to file a bug. We're not on 19.05 yet so we're not affected (yet) but this may cause us some pain when we get to that point (though at leas

Re: [slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

2019-06-06 Thread Christopher Samuel
On 6/6/19 12:01 PM, Kilian Cavalotti wrote: Levi did already. Aha, race condition between searching bugzilla and writing the email. ;-) -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] status of cloud nodes

2019-06-19 Thread Christopher Samuel
On 6/18/19 11:29 PM, nathan norton wrote: Without knowing the internals of slurm it feels like nodes that are turned off+cloud state don't exist in the system until they are on? Not quite, they exist internally but are not exposed until in use: https://slurm.schedmd.com/elastic_computing.html

Re: [slurm-users] ConstrainRAMSpace=yes and page cache?

2019-06-21 Thread Christopher Samuel
On 6/13/19 5:27 PM, Kilian Cavalotti wrote: I would take a look at the various *KmemSpace options in cgroups.conf, they can certainly help with this. Specifically I think you'll want: ConstrainKmemSpace=no to fix this. This happens for NFS and Lustre based systems, I don't think it's a pro

Re: [slurm-users] Hide Filesystem From Slurm

2019-07-11 Thread Christopher Samuel
On 7/11/19 8:19 AM, Douglas Duckworth wrote: I am wondering if it's possible to hide a file system, that's world writable on compute node, logically within Slurm.  That way any job a user runs cannot possible access this file system. Essentially we define $TMPDIR as /scratch, which Slurm clea

Re: [slurm-users] pam_slurm_adopt and memory constraints?

2019-07-15 Thread Christopher Samuel
On 7/12/19 6:21 AM, Juergen Salk wrote: I suppose this is nevertheless the expected behavior and just the way it is when using pam_slurm_adopt to restrict access to the compute nodes? Is that right? Or did I miss something obvious? Could it be a RHEL7 specific issue? It looks like it's workin

Re: [slurm-users] Invalid qos specification

2019-07-15 Thread Christopher Samuel
On 7/15/19 11:22 AM, Prentice Bisbal wrote: $ salloc -p general -q debug  -t 00:30:00 salloc: error: Job submit/allocate failed: Invalid qos specification what does: scontrol show part general say? Also, does the user you're testing as have access to that QOS? All the best, Chris -- Chri

Re: [slurm-users] pam_slurm_adopt and memory constraints?

2019-07-17 Thread Christopher Samuel
On 7/17/19 4:05 AM, Andy Georges wrote: Can you show what your /etc/pam.d/sshd looks like? For us it's actually here: --- # cat /etc/pam.d/common-account #%PAM-1.0 # # This file is autogenerated by pam-config. All changes # will be o

Re: [slurm-users] Using the manager as compute Node

2019-08-05 Thread Christopher Samuel
On 8/5/19 8:00 AM, wodel youchi wrote: Do I have to declare it, for example with 10 CPUs and 32Gb of RAM to save the rest for the management, or will slurmctld take that in hand? You will need both to declare it and also use cgroups to enforce it so that processes can't overrun that limit.

Re: [slurm-users] AllocNodes on partition no longer working

2019-08-14 Thread Christopher Samuel
On 8/14/19 10:46 AM, Sajdak, Doris wrote: We upgraded from version 18.08.4 to 19.05.1-2 today and are suddenly getting a permission denied error on partitions where we have AllocNodes set.  If we remove the AllocNodes constraint, the job submits successfully but then users can submit from anyw

Re: [slurm-users] AllocNodes on partition no longer working

2019-08-15 Thread Christopher Samuel
On 8/15/19 7:18 AM, Sajdak, Doris wrote: Thanks Chris! That worked. We'd tried IP address but not FQDN. Great to hear! -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Slurm 19.05 --workdir non existent?

2019-08-15 Thread Christopher Samuel
On 8/15/19 11:02 AM, Mark Hahn wrote: it's in NEWS, if that counts.  also, I note that at least in this commit, --chdir is added but --workdir is not removed from option parsing. It went away here: commit 9118a41e13c2dfb347c19b607bcce91dae70f8c6 Author: Tim Wickberg Date: Tue Mar 12 23:20:

Re: [slurm-users] 19.05 and GPUs vs GRES

2019-09-05 Thread Christopher Samuel
On 8/13/19 10:44 PM, Barbara Krašovec wrote: We still have the gres configuration, users have their workload scripted and some still use sbatch with gres. Both options work. I missed this before Barbara, sorry - that's really good to know that the options aren't mutually exclusive, thank you!

Re: [slurm-users] 19.05 and GPUs vs GRES

2019-09-05 Thread Christopher Samuel
On 9/5/19 3:49 PM, Bill Broadley wrote: I have a user with a particularly flexible code that would like to run a single MPI job across multiple nodes, some with 8 GPUs each, some with 2 GPUs. Perhaps they could just specify a number of tasks with cpus per task, mem per task and GPUs per task

Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-10 Thread Christopher Samuel
On 9/4/19 9:40 AM, Sam Gallop (NBI) wrote: I did play around with XFS quotas on our large systems (SGI UV300, HPE MC990-X and Superdome Flex) but it couldn't get it working how I wanted (or how I thought it should work). I'll re-visit it knowing that other people have got XFS quotas working.

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-15 Thread Christopher Samuel
On 9/15/19 4:17 PM, Brian Andrus wrote: Are steps required to capture Max RSS? No, you should see a MaxRSS reported for the batch step, for instance: $ sacct -j $JOBID -o jobid,jobname,maxrss All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

[slurm-users] How to trigger kernel stacktraces for stuck processes from unkillable steps

2019-09-18 Thread Christopher Samuel
Hi all, At the Slurm User Group I mentioned about how to tell the kernel to dump information about stuck processes from your unkillable step script to the kernel log buffer (seen via dmesg and hopefully syslog'd somewhere useful for you). echo w > /proc/sysrq-trigger That's it.. ;-) You pr

Re: [slurm-users] How to share GPU resources? (MPS or another way?)

2019-10-09 Thread Christopher Samuel
On 10/8/19 12:30 PM, Goetz, Patrick G wrote: It looks like GPU resources can only be shared by processes run by the same user? This is touched on in this bug https://bugs.schedmd.com/show_bug.cgi?id=7834 where it appears at one point MPS appeared to work for multiple users. It may be that

Re: [slurm-users] Removing user from slurm configuration

2019-10-11 Thread Christopher Samuel
On 10/10/19 8:53 AM, Marcus Wagner wrote: if you REALLY want to get rid of that user, you might need to manipulate the SQL Database. Yeah, I really don't think that would be a safe thing to do. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] understanding resource reservations

2019-10-21 Thread Christopher Samuel
On 10/21/19 3:05 PM, c b wrote: 1) It looks like there's a way to create a daily recurring reservation by specifying "flags=daily" .  How would I make a regular reservation for weekdays only? flags=WEEKDAY Repeat the reservation at the same time on every weekday (Monday, Tuesday, Wedne

Re: [slurm-users] can't get fairshare to be calculated per partition

2019-10-29 Thread Christopher Samuel
On 10/29/19 12:42 PM, Igor Feghali wrote: fairshare is been calculated for the entire cluster and not per partition. That's correct - jobs can request multiple partitions (and will run in the first one available to service it). All the best, Chris -- Chris Samuel : http://www.csamuel.or

Re: [slurm-users] oom-kill events for no good reason

2019-11-07 Thread Christopher Samuel
On 11/7/19 8:36 AM, David Baker wrote: We are dealing with some weird issue on our shared nodes where job appear to be stalling for some reason. I was advised that this issue might be related to the oom-killer process. We do see a lot of these events. In fact when I started to take a closer lo

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Christopher Samuel
On 11/13/19 10:42 AM, Ole Holm Nielsen wrote: Your order of upgrading is *disrecommended*, see for example page 6 in the presentation "Field Notes From A MadMan, Tim Wickberg, SchedMD" in the page https://slurm.schedmd.com/publications.html Also the documentation for upgrading here: https://

Re: [slurm-users] Array jobs vs. many jobs

2019-11-22 Thread Christopher Samuel
Hi Ryan, On 11/22/19 12:18 PM, Ryan Novosielski wrote: Quick question that I'm not sure how to find the answer to otherwise: do array jobs have less impact on the scheduler in any way than a whole long list of jobs run the more traditional way? Less startup overhead, anything like that? Slu

Re: [slurm-users] sbatch sending the working directory from the controller to the node

2020-01-22 Thread Christopher Samuel
On 1/21/20 11:27 AM, Dean Schulze wrote: The sbatch docs say nothing about why the node gets the pwd from the controller.  Why would slurm send a directory to a node that may not exist on the node and expect it to use it? That's a pretty standard expectation from a cluster, that the filesyste

Re: [slurm-users] How to use Autodetect=nvml in gres.conf

2020-02-07 Thread Christopher Samuel
Hi Dean, On 2/7/20 8:03 AM, dean.w.schu...@gmail.com wrote: I just checked the .deb package that I build from source and there is nothing in it that has nv or cuda in its name. Are you sure that slurm distributes nvidia binaries? SchedMD only distributes sources, it's up to distros how they

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-25 Thread Christopher Samuel
On 2/25/20 11:41 AM, Dean Schulze wrote: I'm very interested in the "configless" setup for slurm.  Is the setup for configless documented somewhere? Looks like the website has already been updated for the 20.02 documentation, and it looks like it's here: https://slurm.schedmd.com/configless

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Christopher Samuel
On 2/27/20 11:23 AM, Robert Kudyba wrote: OK so does SLURM support MPS and if so what version? Would we need to enable cons_tres and use, e.g., --mem-per-gpu? Slurm 19.05 (and later) supports MPS - here's the docs from the most recent release of 19.05: https://slurm.schedmd.com/archive/slur

Re: [slurm-users] Slurm 19.05 X11-forwarding

2020-02-29 Thread Christopher Samuel
On 2/28/20 8:56 PM, Pär Lundö wrote: I thought that I could run the srun-command with X11-forwarding called from an sbatch-jobarray-script and get the X11-forwarding to my display. No, I believe X11 forwarding can only work when you run "srun --x11" directly on a login node, not from inside a

Re: [slurm-users] Block interactive shell sessions

2020-03-05 Thread Christopher Samuel
On 3/5/20 9:22 AM, Luis Huang wrote: We would like to block certain nodes from accepting interactive jobs. Is this possible on slurm? My suggestion would be to make a partition for interactive jobs that only contains the nodes that you want to run them and then use the submit filter to direc

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-12 Thread Christopher Samuel
On 3/12/20 9:37 PM, Kirill 'kkm' Katsnelson wrote: Aaah, that's a cool find! I never really looked inside my nodes for more than a year since I debugged all my stuff so it "just works". They are conjured out of nothing and dissolve back into nothing after 10 minutes of inactivity. But good to

Re: [slurm-users] Accounting Information from slurmdbd does not reach slurmctld

2020-03-19 Thread Christopher Samuel
On 3/19/20 4:05 AM, Pascal Klink wrote: However, there was not real answer given why this happened. So we thought that maybe this time someone may have an idea. To me it sounds like either your slurmctld is not correctly registering with slurmdbd, or if it has then slurmdbd cannot connect ba

Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-07 Thread Christopher Samuel
On 4/7/20 2:48 PM, Robert Kudyba wrote: How can I get this to work by loading the correct Bright module? You can't - you will need to recompile Slurm. The error says: Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to autodetect nvml functionality, but we weren't able to f

Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-08 Thread Christopher Samuel
Hi Robert, On 4/8/20 7:08 AM, Robert Kudyba wrote: and the NVIDIA Management Library (NVML) is installed on the node and was found during Slurm configuration That's the key phrase - when whoever compiled Slurm ran ./configure *before* compilation it was on a system without the nvidia librari

Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-08 Thread Christopher Samuel
On 4/8/20 12:17 PM, Robert Kudyba wrote: As I wrote we use Bright Cluster on CentOS 7.7. So we just follow their instructions  to use yum install slurm20, here they show Slurm 19 but it's the same for 20 In th

Re: [slurm-users] Munge decode failing on new node

2020-04-22 Thread Christopher Samuel
On 4/22/20 12:56 PM, dean.w.schu...@gmail.com wrote: There is a third user account on all machines in the cluster that is the user account for using the cluster. That account has uid 1000 on all four worker nodes, but on the controller it is 1001. So that is probably why the question marks.

Re: [slurm-users] Do not upgrade mysql to 5.7.30!

2020-05-07 Thread Christopher Samuel
On 5/7/20 6:08 AM, Riebs, Andy wrote: Alternatively, you could switch to MariaDB; I've been using that for years. Debian switched to only having MariaDB in 2017 with the release of Debian 9 (Stretch), as a derivative distro I'm surprised that Ubuntu still packages MySQL. I'd second Andy's

Re: [slurm-users] additional jobs killed by scancel.

2020-05-13 Thread Christopher Samuel
On 5/11/20 9:52 am, Alastair Neil wrote: [2020-05-10T00:26:05.202] [533900.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9 This caught my eye, Googling for it found a single instance, from 2019 on the list again about jobs on a node mysteriously dying. The resolution was (cou

Re: [slurm-users] Are SLURM_JOB_USER and SLURM_JOB_UID always constant and available

2020-05-20 Thread Christopher Samuel
On 5/20/20 7:23 pm, Kevin Buckley wrote: Are they set as part of the job payload creation, and so would ignore and node local lookup, or set as the job gets allocated to the various nodes it will run on? Looking at git, it's a bit of both: src/slurmd/slurmd/req.c: setenvf(&env, "SLUR

Re: [slurm-users] Nodes do not return to service after scontrol reboot

2020-06-16 Thread Christopher Samuel
On 6/16/20 8:16 am, David Baker wrote: We are running Slurm v19.05.5 and I am experimenting with the *scontrol reboot * command. I find that compute nodes reboot, but they are not returned to service. Rather they remain down following the reboot.. How are you using "scontrol reboot" ? We do:

Re: [slurm-users] [EXT] Jobs Immediately Fail for Certain Users

2020-07-07 Thread Christopher Samuel
On 7/7/20 5:57 pm, Jason Simms wrote: Failed to look up user weissp: No such process That looks like the user isn't known to the node. What do these say: id weissp getent passwd weissp Which version of Slurm is this? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Ber

Re: [slurm-users] Restart Job after sudden reboot of the node

2020-07-24 Thread Christopher Samuel
On 7/24/20 12:28 pm, Saikat Roy wrote: If SLURM restarts automatically, is there any way to stop it? If you would rather Slurm not start scheduling jobs when it is restarted then you can set your partitions to have `State=DOWN` in slurm.conf. That way should the node running slurmctld reboo

Re: [slurm-users] cgroup limits not created for jobs

2020-07-26 Thread Christopher Samuel
On 7/26/20 12:21 pm, Paul Raines wrote: Thank you so much.  This also explains my GPU CUDA_VISIBLE_DEVICES missing problem in my previous post. I've missed that, but yes, that would do it. As a new SLURM admin, I am a bit suprised at this default behavior. Seems like a way for users to game

Re: [slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Christopher Samuel
On 8/6/20 10:13 am, Jason Simms wrote: Later this month, I will have to bring down, patch, and reboot all nodes in our cluster for maintenance. The two options available to set nodes into a maintenance mode seem to be either: 1) creating a system-wide reservation, or 2) setting all nodes into

Re: [slurm-users] Current status of checkpointing

2020-08-14 Thread Christopher Samuel
On 8/14/20 6:17 am, Stefan Staeglich wrote: what's the current status of the checkpointing support in SLURM? There isn't any these days, there used to be support for BLCR but that's been dropped as BLCR is no more. I know from talking with SchedMD they are of the opinion that any current c

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
Hi Sajesh, On 10/8/20 11:57 am, Sajesh Singh wrote: debug:  common_gres_set_env: unable to set env vars, no device files configured I suspect the clue is here - what does your gres.conf look like? Does it list the devices in /dev for the GPUs? All the best, Chris -- Chris Samuel : http:/

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
On 10/8/20 3:48 pm, Sajesh Singh wrote: Thank you. Looks like the fix is indeed the missing file /etc/slurm/cgroup_allowed_devices_file.conf No, you don't want that, that will allow all access to GPUs whether people have requested them or not. What you want is in gres.conf and looks lik

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
Hi Sajesh, On 10/8/20 4:18 pm, Sajesh Singh wrote: Thank you for the tip. That works as expected. No worries, glad it's useful. Do be aware that the core bindings for the GPUs would likely need to be adjusted for your hardware! Best of luck, Chris -- Chris Samuel : http://www.csamuel

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-19 Thread Christopher Samuel
On 10/19/20 7:15 pm, Kevin Buckley wrote: [...] Just out of interest though, when you built yours on CLE7.0 UP01, what provided the munge: the vannila SLES munge, or a Cray munge ? It's cray-munge for CLE7 UP01. Thanks for the explanation of what you've been running through! I forgot I do ha

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-20 Thread Christopher Samuel
On 10/20/20 12:49 am, Kevin Buckley wrote: only have, as listed before, Munge 0.5.13. I guess the question is (going back to your initial post): > error: Failed build dependencies: >munge-libs is needed by slurm-20.02.5-1.x86_64 Had you installed libmunge2 before trying this build?

Re: [slurm-users] [External] Limit usage outside reservation

2020-10-22 Thread Christopher Samuel
On 10/22/20 12:20 pm, Burian, John wrote: This doesn' t help you now, but Slurm 20.11 is expected to have "magnetic reservations," which are reservations that will adopt jobs that don't specify a reservation but otherwise meet the restrictions of the reservation: Magnetic reservations are in

  1   2   3   >