Re: [slurm-users] scancel problem
Try running with "srun", not "mpirun" Hello everybody, i submit a job with sbatch command (sbatch myprog.sh). My prog.sh is = #!/bin/bash #SBATCH --partition=part2 #SBATCH --ntasks=20 #SBATCH --nodelist= #SBATCH --cpus-per-task=1 #SBATCH --mem= # Memory per node specification is in MB. It is optional. # The default limit is 3000MB per core. #SBATCH --job-name="test" #SBATCH --output=test.output #SBATCH --mail-user=t...@out.gr #SBATCH --mail-type=ALL mpirun -c 20 /home/me/projects/EXP00/opa = The submmited id is 5402. When i cancel the job by the command "scancel 5402" i notice that the job is deleted from the squeue ( the job is not shown in squeue) but making an htop at the node where it was running i see that it continues to be running Moreover, another user submiited his job, which was allocated at the same node The node has 20 cores... What is happenning here? Slurm Version slurm 16.05.9
[slurm-users] Question about networks and connectivity
Hello, Really, I don't know if my question is for this mailing list... but I will explain my problem and, then, you could answer me whatever you think ;) I manage a SLURM clusters composed by 3 networks: a gigabit network used for NFS shares (192.168.11.X). In this network, my nodes are "node01, node02..." in /etc/hosts. a gigabit network used by SLURM (all my nodes are added to SLURM cluster using this network and the hostname assigned via /etc/host to this second network). (192.168.12.X). In this network, my nodes are "clus01, clus02..." in /etc/hosts. a Infiniband network (192.168.13.X). In this network, my nodes are "infi01, infi02..." in /etc/hosts. When I submit a MPI job, SLURM scheduler offers me "n" nodes called, for example, clus01 and clus02 and, there, my application runs perfectly using second network for SLURM connectivity and first network for NFS (and NIS) shares. By default, as SLURM connectivity is on second network, my nodelist contains nodes called "clus0x". However, now, I'm getting a "new" problem. I want to use third network (Infiniband), but as SLURM offers me "clus0x" (second network), my MPI application runs OK but using second network. This problem also occurs, for example, using NAMD (Charmrun) application. So, my questions are: is this SLURM configuration correct for using both networks? If answer is "no", how do I configure SLURM for my purpose? But if answer is "yes", how can I ensure connections in my SLURM job are going in Infiniband? Thanks a lot!!
[slurm-users] Limit output file size with lua script
Hi, I would like to know if it is possible to limit size of the generated output file by a job using a lua script. I have seen "job_descriptor" structure in slurm.h but I have not seen anything to limit that feature. ...I need this because a user submitted a job that has generated a 500 GB output file... without value... Thanks.
[slurm-users] Job not cancelled after "TimeLimit" supered
Hi, my SLURM cluster has configured a partition with a "TimeLimit" of 8 hours. Now, a job is running during 9h30m and it has been not cancelled. During these 9 hours and a half, a script has executed a "scontrol update partition=mypartition state=down" for disabling this partition (educational cluster and at 8:00 start students classes). Why my job hasn't been cancelled? There is no any log at SLURM controller that explains this behaviour. Thanks.
[slurm-users] Limit size submission script
Hello, A researcher that is using a SLURM cluster (version 17.02.7) has created a submit script whose size is 8 MB (yeah!!). I have read SLURM has a limit size in 4MB... This limit, can be changed? Thanks.
Re: [slurm-users] Limit size submission script
Thanks I didn't about that parameter!!! El 09/11/2017 a las 14:45, Brian W. Johanson escribió: man slurm.conf SchedulerParameters The interpretation of this parameter varies by SchedulerType. Multiple options may be comma separated. max_script_size=# Specify the maximum size of a batch script, in bytes. The default value is 4 megabytes. Larger values may adversely impact system performance. On 11/09/2017 03:56 AM, sysadmin.caos wrote: Hello, A researcher that is using a SLURM cluster (version 17.02.7) has created a submit script whose size is 8 MB (yeah!!). I have read SLURM has a limit size in 4MB... This limit, can be changed? Thanks.
Re: [slurm-users] Limit size submission script
Researcher uses python for generating a submission script and, then, she submits that new script... I don't know more... but before Brian wrote about "max_script_size", I had rewritten 8MB script (from a python execution) in a easy 250 Kb script (with 2 "for" ;) ) El 09/11/2017 a las 15:22, Loris Bennett escribió: Hi, I'd be interested to know in what circumstances a multiple-MB-sized batch script would be sensible and/or necessary. Cheers, Loris "Brian W. Johanson" writes: man slurm.conf SchedulerParameters The interpretation of this parameter varies by SchedulerType. Multiple options may be comma separated. max_script_size=# Specify the maximum size of a batch script, in bytes. The default value is 4 megabytes. Larger values may adversely impact system performance. On 11/09/2017 03:56 AM, sysadmin.caos wrote: Hello, A researcher that is using a SLURM cluster (version 17.02.7) has created a submit script whose size is 8 MB (yeah!!). I have read SLURM has a limit size in 4MB... This limit, can be changed? Thanks.
[slurm-users] srun not allowed in a partition
Hello, I would like to configure SLURM with two partitions: one called "batch.q" only for batchs jobs one called "interactive.q" only for batch jobs What I want to get is a batch partition that doesn't allow "srun" commands from the command line and a interactive partition only for "srun" commands. Is it possible in SLURM? Thanks.
Re: [slurm-users] srun not allowed in a partition
I'm trying to compile SLURM-17.02.7 with "lua" support executing "./configure && make && make contribs && make install", but make does nothing in src/plugins/job_submit/lua and I don't know why... How do I have to compile that plugin? The rest of the plugins compile with no problems (defaults, all_partitions, partition,...) Thanks
[slurm-users] job_submit.lua script
Hello, I'm writing my own "job_submit.lua" for controlling in what partition a user can run a "srun" and how many CPUs and nodes are allowed. I want to allow only "srun" in partition "interactive" with only one core and one node. I have wrote this script but I'm getting errors: function slurm_job_submit(job_desc, part_list, submit_uid) local partition = "interactive" if ((job_desc.script == nil or job_desc.script == '') and job_desc.partition ~= partition) then slurm.log_info("slurm_job_submit: interactive job submitted by user_id:%d to partition:%s rejected", job_desc.user_id, job_desc.partition) return slurm.FAILURE end local max = 1 if (job_desc.cpus_per_task > max or job_desc.max_cpus > max or job_desc.max_nodes > max) then slurm.log_user("slurm_job_submit: parameter error %s %s %u", job_desc.cpus_per_task, job_desc.max_cpus, job_desc.max_nodes) return slurm.FAILURE end return slurm.SUCCESS end It seems fields "cpus_per_task" and "max_cpus" are not being printed correctly ("max_nodes" is allright), because after running "srun" with this: srun -p interactive.q -N 1 -n 1 -w aolin15 --pty bash ...I get this error: srun: error: slurm_job_submit: parameter error 65534 4294967294 1 Fields "cpus_per_task" is uint16_t, while "max_cpus" and "max_nodes" are uint32_t... but I have tried with "%u" and it didn't work... Could anybody help me? Thanks.
Re: [slurm-users] job_submit.lua script
My purpose with job_submit.lua script is to limit a "srun" with more than one node and more than one CPU; in others words, "srun -N 1 -n 1". Because of this reason, in my future script I execute "if" for comparing that values: function slurm_job_submit(job_desc, part_list, submit_uid) local partition = "interactive" if ((job_desc.script == nil or job_desc.script == '') and job_desc.partition ~= partition) then slurm.log_info("slurm_job_submit: interactive job submitted by user_id:%d to partition:%s rejected", job_desc.user_id, job_desc.partition) return slurm.FAILURE end local max = 1 if (job_desc.cpus_per_task > max or job_desc.max_cpus > max or job_desc.max_nodes > max) then slurm.log_user("slurm_job_submit: parameter error %s %s %u", job_desc.cpus_per_task, job_desc.max_cpus, job_desc.max_nodes) return slurm.FAILURE end return slurm.SUCCESS end I undestand that: -N 1 --> job_desc.max_nodes=1 -n 1 --> job_desc.max_cpus=1 Am I wrong? Thanks.
[slurm-users] Limit job_submit.lua script for only srun
Hello, I have written my own job_submit.lua script for limiting "srun" executions to one processor, one task and one node. If I test it with "srun", all works fine. However, if now I try to run a sbatch job with "-N 12" or "-n 2", job_submit.lua is also checked and, then, my job is rejected because I'm requesting more than one task and more than one node. So, is it possible that lua script only active when user runs a "srun" and not a "sbatch"? I have been reading "typedef struct job_descriptor" at slurm/slurm.h file but there is no record that keeps command run by the user in the command line. Thanks.
[slurm-users] sacct not shows user
Hello, when I run "sacct", output is this: JobID JobName Partition Account AllocCPUS State ExitCode -- -- -- -- -- [...] 2810 bas nodo.q (null) 0 FAILED 2:0 2811 bash nodo.q (null) 0 COMPLETED 0:0 2812 ModelRepa+ nodo.q (null) 0 CANCELLED 0:0 2813 ModelRepa+ nodo.q (null) 0 CANCELLED 0:0 2814 ModelRepa+ nodo.q (null) 0 CANCELLED 0:0 2815 ModelRepa+ nodo.q (null) 0 CANCELLED 0:0 2816 ModelRepa+ nodo.q (null) 0 CANCELLED 0:0 2817 bash nodo.q (null) 0 COMPLETED 0:0 2807 ModelRepa+ nodo.q (null) 0 CANCELLED 0:0 2808 ModelFLAM+ nodo.q (null) 0 CANCELLED 0:0 2818 ModelRepa+ nodo.q (null) 0 CANCELLED 0:0 2819 bash nodo.q (null) 0 COMPLETED 0:0 2820 ModelRepa+ nodo.q (null) 0 PENDING 0:0 [...] It seems "Account" column always shows "(null)" value. Is it normal or my SLURM has a wrong configuration? Thanks.
Re: [slurm-users] sacct not shows user
I'm using AccountingStorageType=accounting_storage/filetxt because I'm running some tests. With "filetxt", could I get "account" (username) with sacct?
[slurm-users] Accounting not recording jobs
Hello, after configuring SLURM-17.11.5 with accouting/mysql, it seems databse is not recording any job. If I run "sacct -", I get this output: sacct: Jobs eligible from Tue May 08 00:00:00 2018 - Now sacct: debug: Options selected: opt_completion=0 opt_dup=0 opt_field_list=(null) opt_help=0 opt_allocs=0 sacct: debug3: Trying to load plugin /soft/slurm-17.11.5/lib/slurm/accounting_storage_slurmdbd.so sacct: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null) sacct: debug3: Success. sacct: debug4: Accounting storage SLURMDBD plugin loaded sacct: debug3: Trying to load plugin /soft/slurm-17.11.5/lib/slurm/auth_munge.so sacct: debug: Munge authentication plugin loaded sacct: debug3: Success. sacct: debug: slurmdbd: Sent PersistInit msg sacct: debug2: Clusters requested: q50004 sacct: debug2: Userids requested: all JobID JobName Partition Account AllocCPUS State ExitCode -- -- -- -- -- sacct: debug: slurmdbd: Sent fini msg However, I supose it would have to appear a job I have submited (and finished) a minute ago Why does it not appear? My slurmdbd.conf is: ArchiveEvents=yes ArchiveJobs=yes ArchiveSteps=no ArchiveSuspend=no AuthInfo=/var/run/munge/munge.socket.2 AuthType=auth/munge DbdHost=localhost DebugLevel=4 PurgeEventAfter=12month PurgeJobAfter=12month PurgeResvAfter=12month PurgeStepAfter=12month PurgeSuspendAfter=12month LogFile=/var/log/slurmdbd.log PidFile=/var/tmp/slurm/slurmdbd.pid SlurmUser=slurm StorageHost=localhost StorageLoc=slurmdb StoragePass=slurm StorageType=accounting_storage/mysql StorageUser=slurm And my slurm.conf is: [...] # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/slurmdbd AccountingStorageLoc=/var/log/slurm/accounting JobCompType=jobcomp/filetxt JobCompLoc=/var/log/slurm/job_completions ClusterName=Q50004 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=4 SlurmctldLogFile=/var/log/slurmdctl.log SlurmdDebug=4 SlurmdLogFile=/var/log/slurmd.log [...] In file /var/log/slurm/accounting appear my last job... but I don't undertand why job appears there while I have configured accounting with "AccountingStorageType=accounting_storage/slurmdbd" Thanks. Thanks.