[slurm-dev] Re: Thoughts on GrpCPURunMins as primary constraint?

2017-07-24 Thread Ryan Cox
artially practical (to ensure users explicitly requesting slow nodes instead of just dumping them on ancient Opterons). Also, each user gets their own Account, so the QoS Grp limits apply to each human separately. Accounts would also have absolute core limits. Thank you for your thoughts! Corey

[slurm-dev] Re: Job Submit Lua Plugin

2017-06-27 Thread Ryan Cox
eractively) SLURM version: 17.02.5, compiled from source (after installing Lua) using ./configure --prefix=/usr --sysconfdir=/etc/slurm Any guidance to get me up and running would be greatly appreciated! Thanks, Nathan -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Ryan Cox
gt; > Could you set AllowedRamSpace/AllowedSwapSpace in /etc/slurm/cgroup.conf to some big number? That way the job memory limit will be the cgroup soft limit, and the cgroup hard limit which is when the kernel will OOM kill the job would be "job_memory_limit * AllowedRamSpace" that is, some large value? -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi <mailto:janne.blomqv...@aalto.fi> -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Cox
I'm sure someone has already blazed this trail before, but this is how I am going about it. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Cox
y and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-19 Thread Ryan Cox
22.2 20.4 17 16.9799 * denotes the node where the batch script executes (node 0) CPU usage is cumulative since the start of the job Ryan On 09/19/2016 11:13 AM, Ryan Cox wrote: We use this script that we cobbled together: https://github.com/BYUHPC/slurm-random/blob/master/rjobstat

[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-19 Thread Ryan Cox
We use this script that we cobbled together: https://github.com/BYUHPC/slurm-random/blob/master/rjobstat. It assumes that you're using cgroups. It uses ssh to connect to each node so it's not very scalable but it works well enough for us. Ryan On 09/18/2016 06:42 PM, Igor Yakushin wrote: ho

[slurm-dev] Re: squeue and nodelist format

2016-05-06 Thread Ryan Cox
searched the documentation and I just can’t seem to find any switch to enable that. Help me Obiwan Kenobi, you’re my only hope! -- Nick Eggleston Missouri S&T IT Research Support Services -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: scontrol update not allowing jobs

2016-04-15 Thread Ryan Cox
ces The George Washington University 725 21st Street Washington, DC 20052 Suite 211, Corcoran Hall == On Fri, Apr 15, 2016 at 1:07 PM, Ryan Cox <mailto:ryan_...@byu.edu>> wrote: Did you try this: --reservation=root_13 O

[slurm-dev] Re: scontrol update not allowing jobs

2016-04-15 Thread Ryan Cox
Did you try this: --reservation=root_13 On 04/15/2016 08:10 AM, Glen MacLachlan wrote: scontrol update not allowing jobs Dear all, Wrapping up a maintenance period and I want to run some test jobs before I release the reservation and allow regular user jobs to start running. I've modified th

[slurm-dev] Re: AssocGrp*Limits being considered for scheduling

2016-02-23 Thread Ryan Cox
Coincidentally, I asked about that yesterday in a bug report: http://bugs.schedmd.com/show_bug.cgi?id=2465. The short answer is to use SchedulerParameters=assoc_limit_continue that was introduced in 15.08.8. It only works if the Reason for the job is something like Assoc*Limit. Ryan On 02

[slurm-dev] Re: distribution for array jobs

2016-01-28 Thread Ryan Cox
SelectType = select/cons_res* *SelectTypeParameters = CR_CORE_MEMORY* What am I missing to get more than one job to run on a node? Thanks in advance, Brian Andrus -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-13 Thread Ryan Cox
, and slurmctld decided the data was invalid and killed all jobs. (I don't know if this is still a problem.) -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Batch job submission failed: Invalid account or account/partition combination specified

2015-09-08 Thread Ryan Cox
We have seen similar issues on 14.11.8 but haven't bothered to diagnose or report it. I think I've seen it twice so far out of dozens of new users. Ryan On 09/07/2015 09:16 AM, Loris Bennett wrote: Hi, This problem occurs with 14.11.8. A user I set up today got the following error when su

[slurm-dev] Re: Changing /dev file permissions for particular user

2015-06-24 Thread Ryan Cox
Be sure to test it first before trying anything else: https://stackoverflow.com/questions/18661976/reading-dev-cpu-msr-from-userspace-operation-not-permitted. We ran into this issue once when we had a "trusted" person and we couldn't easily grant him access to the MSRs. We couldn't find a goo

[slurm-dev] Re: concurrent job limit

2015-06-11 Thread Ryan Cox
horized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -- Ryan Cox Operations Director Fulton Supercom

[slurm-dev] Re: cgroup setup and cpuset issues

2015-06-10 Thread Ryan Cox
dvanced for your assistance. Jackie Scoggins -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: FAIR_TREE in SLURM 14.11

2015-06-04 Thread Ryan Cox
= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu <mailto:treyd...@tamu.edu> Jabber: treyd...@tamu.edu <mailto:treyd...@tamu.edu> On

[slurm-dev] Re: FAIR_TREE in SLURM 14.11

2015-06-04 Thread Ryan Cox
demy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu <mailto:treyd...@tamu.edu> Jabber: treyd...@tamu.edu <mailto:treyd...@tamu.edu> -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: GPU node allocation policy

2015-04-07 Thread Ryan Cox
iomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu- 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `' On Apr 6, 2015, at 20:17, Ryan Cox wrote: Chris, Just have GPU users request the numbers of CPU cores that t

[slurm-dev] Re: GPU node allocation policy

2015-04-06 Thread Ryan Cox
Chris, Just have GPU users request the numbers of CPU cores that they need and don't lie to Slurm about the number of cores. If a GPU user needs 4 cores and 4 GPUs, have them request that. That leaves 20 cores for others to use. Ryan On 04/06/2015 03:43 PM, Christopher B Coffey wrote: H

[slurm-dev] RE: fairshare allocations

2015-01-21 Thread Ryan Cox
On 01/21/2015 09:23 AM, Bill Wichser wrote: A user underneath gets the expected 0.009091 normalized shares since there are a lot of fairshare=1 users there. The user3 gets basically 25x this value as the fairshare for user3=25 Yet the normalized shares is actually MORE than the normalized

[slurm-dev] Re: GresTypes typo in docs

2015-01-06 Thread Ryan Cox
ht out by a typo on http://slurm.schedmd.com/gres.html where the example has GresType=gpu,bandwith rather than GresTypes=... Could you please fix the doc! BTW. Slurm was quite ungracious about having that bad entry in slurm.conf Regards, Gareth -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-26 Thread Ryan Cox
if you just want to see what was used, you can get the raw usage using sacct. For example, for a given job, you can do something like: > > sacct -X -a -j 1182128 --format Jobid,jobname,partition,account,alloccpus,state,exitcode,cputimeraw > > - > Gary Skouson > > >

[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-26 Thread Ryan Cox
ovember 25, 2014 9:51 AM *To:* slurm-dev *Subject:* [slurm-dev] Re: [ sshare ] RAW Usage Thanks Ryan, Is this value stored anywhere in the SLURM accounting DB? I could not find any value for the JOB that corresponds to this RAW usage. Roshan -----

[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-25 Thread Ryan Cox
Raw usage is a long double and the time added by jobs can be off by a few seconds. You can take a look at _apply_new_usage() in src/plugins/priority/multifactor/priority_multifactor.c to see exactly what happens. Ryan On 11/25/2014 10:34 AM, Roshan Mathew wrote: Hello SLURM users, http://s

[slurm-dev] Re: How many accounts can SLURM support?

2014-11-19 Thread Ryan Cox
Dave, I have done testing on 5-6 year old hardware with 100,000 users randomly distributed in 10,000 accounts with semi-random depths with most being between 1-4 levels from root but some much deeper than that, plus 100,000 jobs pending. slurmctld startup time was really long but, after gett

[slurm-dev] Re: Non static partition definition

2014-10-30 Thread Ryan Cox
George, Wouldn't a QOS with GrpNodes=10 accomplish that? Ryan On 10/30/2014 11:47 AM, Brown George Andrew wrote: Hi, I would like to have a partition of N nodes without statically defining which nodes should belong to a partition and I'm trying to work out the best way to achieve this. Cu

[slurm-dev] Re: Understanding Fairshare and effect on background/backfill type partitions

2014-10-27 Thread Ryan Cox
Trey, I'm not sure why your jobs aren't starting. Someone else will have to answer that question. You can model an organizational hierarchy a lot better in 14.11 due to changes in Fairshare=parent for accounts. If you only want fairshare to matter at the research group and user levels but

[slurm-dev] RE: EXTERNAL: Re: question on multifactor priority plugin - fairshare basics

2014-10-16 Thread Ryan Cox
oping/wishing the values would be between 0.0 and 1.0, but I can work with 0.5 as the max value. It just means that I need to double the PriorityWeightFairshare factor in order to achieve the intended relative weighting between Fairshare, QOS, Partitions, JobSize, Age. Ed *From:*Ryan Cox [mailto:

[slurm-dev] Re: question on multifactor priority plugin - fairshare basics

2014-10-14 Thread Ryan Cox
I assume you are using the default fairshare algorithm since you didn't specify otherwise. F=2**(-U/S) where U is Effectv Usage (often displayed in documentation as UE) and S is Norm Shares. See http://slurm.schedmd.com/priority_multifactor.html under the heading "The SLURM Fair-Share Formula

[slurm-dev] Re: Authentication and invoking slurm commands from web app

2014-10-02 Thread Ryan Cox
please notify us by e-mail or by telephone (+ 34 690207492). Any reproduction of this e-mail by whatsoever means and any transmission or dissemination thereof to other persons is prohibited. It should be deleted immediately from your system. Idiria Sociedad Limitada reserves the right to take legal action against any persons unlawfully gaining access to the content of any external message it has emitted. For additional information, please visit our website http://www.idiria.com <http://www.idiria.com/> -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Submitting to multiple partitions with job_submit plugin (Was: Implementing fair-share policy using BLCR)

2014-09-29 Thread Ryan Cox
On 09/23/2014 11:27 AM, Trey Dockendorf wrote: Has anyone used the Lua job_submit plugin and also allows multiple partitions? I'm not even user what the partition value would be in the Lua code when a job is submitted with "--partition=general,background", for example. We do. We use the a

[slurm-dev] Fair Tree Q&A (previously Level-Based)

2014-09-26 Thread Ryan Cox
It was great to see so many of you at Slurm User Group Meeting. We received several questions after our presentation and wanted to clarify some things. One of our "possible concerns" was about a "tiny user in a very active account". To clarify, this is the scenario we were mentioning. We

[slurm-dev] Re: Dynamic partitions on Linux cluster

2014-08-14 Thread Ryan Cox
So is there a way to achieve this using the confg file? Do I have to use accounting to enfoce the limits? Or is there another way that I don't see? Best regards, Uwe Sauter -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Customized error messages from job_submit.lua?

2014-08-07 Thread Ryan Cox
slurm.user_msg("Test message") return slurm.ERROR On 08/07/2014 04:40 AM, Bjørn-Helge Mevik wrote: I read in the NEWS for 14.03.0pre1: -- Add mechanism for job_submit plugin to generate error message for srun, salloc or sbatch to stderr. New argument added to job_submit function in

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox
All, There has been more conversation on http://bugs.schedmd.com/show_bug.cgi?id=858. It might be good to post future comments there so we have just one central location for everything. No worries if you'd rather reply on the list. Once a solution is ready I'll post something to the list

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox
Thanks. I can certainly call it that. My understanding is that this would be a slightly different implementation from Moab/Maui, but I don't know those as well so I could be wrong. Either way, the concept is similar enough that a more recognizable term might be good. Does anyone else have

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox
ore flexible than DRF in that it allows arbitrary charge rates to be specified, I'm not sure it makes sense to specify rates different from the DRF ones? Or if one does specify different rates, it might end up breaking some of the fairness properties that are described in the DRF paper

[slurm-dev] Re: All cores being allocated / -n ignored

2014-07-30 Thread Ryan Cox
: #SBATCH -J NAG_int_tip3p_rep2 #SBATCH -o NAG_int_tip3p_rep2.out #SBATCH -e NAG_int_tip3p_rep2.err #SBATCH -n 2 #SBATCH -p debug #SBATCH -D /home/gordon/cpgh89/autodock/NAG_DNAP #SBATCH -w riddley Can anyone explain what I'm doing in this setup? -- max(∫(εὐδαιμονία)dt) -- Ryan Cox Opera

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-29 Thread Ryan Cox
o take. The patch currently implements charging for CPUs, memory (GB), and nodes. Note: I saw a similar idea in a bug report from the University of Chicago: http://bugs.schedmd.com/show_bug.cgi?id=858. Ryan On 07/25/2014 10:31 AM, Ryan Cox wrote: Bill and Don, We have wondered about this

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-25 Thread Ryan Cox
sed and so the process is never ending. Another solution is to simply trust the users and just keep reminding them about allocations. They are usually a smart bunch and are quite creative when it comes to getting jobs to run! So maybe I am concerned over nothing at all and things will just

[slurm-dev] Re: fairshare

2014-07-15 Thread Ryan Cox
e value is correct and here is why. Or do I just need to figure out a database query to cull this information? Thanks, Bill -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: installing slurm on CentOS 5.10

2014-06-24 Thread Ryan Cox
ored in an electronic records management system. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] LEVEL_BASED prioritization method

2014-06-20 Thread Ryan Cox
apply to our use case), see http://tech.ryancox.net/2014/06/problems-with-slurm-prioritization.html. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Fairshare=parent on an account: What should it do?

2014-06-10 Thread Ryan Cox
e nice for the grad student to have administrative control over the subaccount since he actually knows the students but not have it affect priority calculations. Ryan -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Requeue and resubmit after networking issue

2014-05-19 Thread Ryan Cox
/ http://twitter.com/vlsci -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: How to spread jobs among nodes?

2014-05-08 Thread Ryan Cox
gt; 'sched/builtin', 'SelectTypeParameters' => 'CR_Core_Memory', 'SelectType'=> 'select/cons_res', -- Perfection is just a word I use occasionally with mustard. --Atom Powers-- -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Need Help Understanding Cgroup Swapiness

2014-04-21 Thread Ryan Cox
group feature to start swapping out the exceeding 50 MB or so... they would actually fit in the swap area and the job should not be killed... What am I missing here? Should the code itself be aware of the given "mem.limit=9000MB"? Thanks for any explanation. MG -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: SLRUM as a load balancer for interactive use

2014-03-25 Thread Ryan Cox
opment Manager Computing Platforms CSC - IT Center for Science Ltd. E-Mail: olli-pekka.le...@csc.fi Tel: +358 50 381 8604 skype: oplehto // twitter: ople -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University http://tech.ryancox.net

[slurm-dev] Re: Job being canceled due to time limits

2013-09-05 Thread Ryan Cox
limit is being imposed about 5 minutes into the job. Thanks -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: job steps not properly identified for jobs using step_batch cgroups

2013-08-12 Thread Ryan Cox
r research computing center university of chicago 773.702.1104 -- andy wettstein hpc system administrator research computing center university of chicago 773.702.1104 -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: cgroups usage

2013-08-06 Thread Ryan Cox
mance. Is this amount realistic? Is there a more efficient method to control memory usage on nodes which are shared? Thank you for any advice, Kevin -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Job submit plugin to improve backfill

2013-06-28 Thread Ryan Cox
vements. Thanks. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Listing jobs running on an arbitrary host?

2013-06-24 Thread Ryan Cox
' to expand each > NodeList. > > This gets... suboptimal at installations with large numbers of jobs in > flight. Is there a better way? > > john -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Job Groups

2013-06-19 Thread Ryan Cox
gt; > On 06/19/2013 12:15 PM, Ryan Cox wrote: >> Paul, >> >> We were discussing this yesterday due to a user not limiting the amount >> of jobs hammering our storage. A QOS with a GrpJobs limit sounds like >> the best approach for both us and you. >> >> R

[slurm-dev] Re: Job Groups

2013-06-19 Thread Ryan Cox
first before > putting a nail in it. From my look at the documentation I don't see > anyway to do this other than what I stated above. > > -Paul Edmon- -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Creating a submission node

2013-06-05 Thread Ryan Cox
e: > Hi, > I am configuring a cluster with computing nodes and two administration > nodes ( with slurmctld and slurmdbd). But i want users to use another > server for job submission. How can i do that ? > It may be easy to do, but i can't find how in the documentation. >

[slurm-dev] Re: untracked processes

2013-02-21 Thread Ryan Cox
cess IDs(?) In any event, I'm guessing I'm not the first >> person to run into this. Is there a recommended solution to >> configure SLURM to track codes like this? >> >> Thanks, >> ~Mike C. >> >> -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University