Re: [slurm-users] Question about memory allocation

2019-12-17 Thread Mahmood Naderan
Please see the latest update # for i in {0..2}; do scontrol show node compute-0-$i | grep RealMemory; done && scontrol show node hpc | grep RealMemory RealMemory=64259 AllocMem=1024 FreeMem=57163 Sockets=32 Boards=1 RealMemory=120705 AllocMem=1024 FreeMem=97287 Sockets=32 Boards=1 RealMem

Re: [slurm-users] Question about memory allocation

2019-12-17 Thread Sean Crosby
Hi Mahmood, Your running job is requesting 6 CPUs per node (4 nodes, 6 CPUs per node). That means 6 CPUs are being used on node hpc. Your queued job is requesting 5 CPUs per node (4 nodes, 5 CPUs per node). In total, if it was running, that would require 11 CPUs on node hpc. But hpc only has 1

Re: [slurm-users] SLURM_TMPDIR

2019-12-17 Thread Angelines
Hi Tina The problem was that slurm was able to create de user directory but later it wasn't able to create job_id directory... In the prolog script I add a chown command and it worked! In epilog script slurm delete the job_id directory so it works fine for me. Thanks! -- ___

Re: [slurm-users] Question about memory allocation

2019-12-17 Thread Marcus Wagner
Dear Mahmood, I'm not aware of any nodes, that have 32, or even 10 sockets. Are you sure, you want to use the cluster like that? Best Marcus On 12/17/19 10:03 AM, Mahmood Naderan wrote: Please see the latest update # for i in {0..2}; do scontrol show node compute-0-$i | grep RealMemory; do

Re: [slurm-users] Question about memory allocation

2019-12-17 Thread Mahmood Naderan
>Your running job is requesting 6 CPUs per node (4 nodes, 6 CPUs per node). That means 6 CPUs are being used on node hpc. >Your queued job is requesting 5 CPUs per node (4 nodes, 5 CPUs per node). In total, if it was running, that would require 11 CPUs on node hpc. But hpc only has 10 cores, so it

Re: [slurm-users] Question about memory allocation

2019-12-17 Thread Sean Crosby
What services did you restart after changing the slurm.conf? Did you do an scontrol reconfigure? Do you have any reservations? scontrol show res Sean On Tue, 17 Dec. 2019, 10:35 pm Mahmood Naderan, mailto:mahmood...@gmail.com>> wrote: >Your running job is requesting 6 CPUs per node (4 nodes, 6

[slurm-users] Two 19.05.4 build questions

2019-12-17 Thread Wiegand, Paul
Greetings, We are upgrading from 18.x to 19.05.4, but the build process for us appears a bit different now. 1) There doesn't appear to be a 19.x OSU mvapich2 patch as there were for previous slurms. Should we use the previous patch or not patch? 2) The acct_gather_profile_hdf5 plugin appears

Re: [slurm-users] slurmd.service fails to register

2019-12-17 Thread William Brown
These are the tests that we use: The following steps can be performed to verify that the software has been properly installed and configured. These should be done as a non-privileged user: • Generate a credential on stdout: $ munge -n • Check if a credential can be loca

Re: [slurm-users] Question about memory allocation

2019-12-17 Thread Mahmood Naderan
>Did you do an scontrol reconfigure? Thank you. That solved the issue. Regards, Mahmood

[slurm-users] Limit output file size with lua script

2019-12-17 Thread sysadmin.caos
Hi, I would like to know if it is possible to limit size of the generated output file by a job using a lua script. I have seen "job_descriptor" structure in slurm.h but I have not seen anything to limit that feature. ...I need this because a user submitted a job that has generated a 500 GB out

[slurm-users] Issues with HA config and AllocNodes

2019-12-17 Thread Dave Sizer
Hello friends, We are running slurm 19.05.1-2 with an HA setup consisting of one primary and one backup controller. However, we are observing that when the backup takes over, for some reason AllocNodes is getting set to “none” on all of our partitions. We can remedy this by manually setting A

Re: [slurm-users] Issues with HA config and AllocNodes

2019-12-17 Thread Dave Sizer
Thanks for the response. I have confirmed that the slurm.conf files are the same and that StateSaveDir is working, we see logs like the following on the backup controller: Recovered state of 9 partitions Recovered JobId=124 Assoc=6 Recovered JobId=125 Assoc=6 Recovered JobId=126 Assoc=6 Recovered

[slurm-users] Distinguishing Job Accounting vs Job Completion data?

2019-12-17 Thread E.M. Dragowsky
Greetings -- >From the Accounting and Resource Limits documentation, there is both the suggestion to make use of both 'Job Accounting' and 'Job Completion' data. There is also the following statement, in the Slurm JobComp Configuration: If you are running with the accounting storage plugin, use o