Re: [slurm-users] scancel problem

2018-09-21 Thread sysadmin.caos

Try running with "srun", not "mpirun"


Hello everybody,
i submit a job with sbatch command (sbatch myprog.sh).  My prog.sh is
=
#!/bin/bash
#SBATCH --partition=part2
#SBATCH --ntasks=20
#SBATCH --nodelist=
#SBATCH --cpus-per-task=1
#SBATCH --mem=
# Memory per node specification is in MB. It is optional.
# The default limit is 3000MB per core.
#SBATCH --job-name="test"
#SBATCH --output=test.output
#SBATCH --mail-user=t...@out.gr
#SBATCH --mail-type=ALL

mpirun -c 20 /home/me/projects/EXP00/opa
=

The submmited id is 5402. When i  cancel the job by the command  "scancel
5402"  i notice that the job is deleted from the squeue ( the job is not
shown in squeue) but making an htop at the node where it was running i see
that it continues to be running
Moreover, another user submiited his job, which was allocated at the same
node   The node has 20 cores...
What is happenning here?

Slurm Version slurm 16.05.9





[slurm-users] Question about networks and connectivity

2019-12-05 Thread sysadmin.caos

  
Hello,

Really, I don't know if my question is for this mailing list... but
I will explain my problem and, then, you could answer me whatever
you think ;)

I manage a SLURM clusters composed by 3 networks:

  a gigabit network used for NFS shares (192.168.11.X). In this
network, my nodes are "node01, node02..." in /etc/hosts.
  
  a gigabit network used by SLURM (all my nodes are added to
SLURM cluster using this network and the hostname assigned via
/etc/host to this second network). (192.168.12.X). In this
network, my nodes are "clus01, clus02..." in /etc/hosts.
  a Infiniband network (192.168.13.X). In this network, my nodes
are "infi01, infi02..." in /etc/hosts.

When I submit a MPI job, SLURM scheduler offers me "n" nodes
  called, for example, clus01 and clus02 and, there, my application
  runs perfectly using second network for SLURM connectivity and
  first network for NFS (and NIS) shares. By default, as SLURM
  connectivity is on second network, my nodelist contains nodes
  called "clus0x".
However, now, I'm getting a "new" problem. I want to use third
network (Infiniband), but as SLURM offers me "clus0x" (second
network), my MPI application runs OK but using second network. This
problem also occurs, for example, using NAMD (Charmrun) application.


So, my questions are:

  is this SLURM configuration correct for using both networks? 
  
  
If answer is "no", how do I configure SLURM for my purpose?
But if answer is "yes", how can I ensure connections in my
  SLURM job are going in Infiniband?
  

Thanks a lot!!

  




[slurm-users] Limit output file size with lua script

2019-12-17 Thread sysadmin.caos

Hi,

I would like to know if it is possible to limit size of the generated 
output file by a job using a lua script. I have seen "job_descriptor" 
structure in slurm.h but I have not seen anything to limit that feature.
...I need this because a user submitted a job that has generated a 500 
GB output file... without value...


Thanks.



[slurm-users] Job not cancelled after "TimeLimit" supered

2020-03-10 Thread sysadmin.caos

Hi,

my SLURM cluster has configured a partition with a "TimeLimit" of 8 
hours. Now, a job is running during 9h30m and it has been not cancelled. 
During these 9 hours and a half, a script has executed a "scontrol 
update partition=mypartition state=down" for disabling this partition 
(educational cluster and at 8:00 start students classes).


Why my job hasn't been cancelled? There is no any log at SLURM 
controller that explains this behaviour.


Thanks.



[slurm-users] Limit size submission script

2017-11-09 Thread sysadmin.caos

Hello,

A researcher that is using a SLURM cluster (version 17.02.7) has created 
a submit script whose size is 8 MB (yeah!!). I have read SLURM has a 
limit size in 4MB... This limit, can be changed?


Thanks.



Re: [slurm-users] Limit size submission script

2017-11-09 Thread sysadmin.caos

Thanks I didn't about that parameter!!!

El 09/11/2017 a las 14:45, Brian W. Johanson escribió:

man slurm.conf

   SchedulerParameters
  The interpretation of this parameter varies by 
SchedulerType.  Multiple options may be comma separated.


  max_script_size=#
 Specify the maximum size of a batch script, in 
bytes.  The default value is 4 megabytes.  Larger values may adversely 
impact system performance.




On 11/09/2017 03:56 AM, sysadmin.caos wrote:

Hello,

A researcher that is using a SLURM cluster (version 17.02.7) has 
created a submit script whose size is 8 MB (yeah!!). I have read 
SLURM has a limit size in 4MB... This limit, can be changed?


Thanks.







Re: [slurm-users] Limit size submission script

2017-11-09 Thread sysadmin.caos
Researcher uses python for generating a submission script and, then, she 
submits that new script... I don't know more... but before Brian wrote 
about "max_script_size", I had rewritten 8MB script (from a python 
execution) in a easy 250 Kb script (with 2 "for" ;) )


El 09/11/2017 a las 15:22, Loris Bennett escribió:

Hi,

I'd be interested to know in what circumstances a multiple-MB-sized
batch script would be sensible and/or necessary.

Cheers,

Loris

"Brian W. Johanson"  writes:


man slurm.conf

    SchedulerParameters
   The interpretation of this parameter varies by SchedulerType.
Multiple options may be comma separated.

   max_script_size=#
  Specify the maximum size of a batch script, in bytes.  The
default value is 4 megabytes.  Larger values may adversely impact system
performance.



On 11/09/2017 03:56 AM, sysadmin.caos wrote:

Hello,

A researcher that is using a SLURM cluster (version 17.02.7) has created a
submit script whose size is 8 MB (yeah!!). I have read SLURM has a limit
size in 4MB... This limit, can be changed?

Thanks.





[slurm-users] srun not allowed in a partition

2018-03-21 Thread sysadmin.caos

  
  
Hello,

I would like to configure SLURM with two partitions:

  one called "batch.q" only for batchs jobs
  one called "interactive.q" only for batch jobs

What I want to get is a batch partition that doesn't allow "srun"
commands from the command line and a interactive partition only for
"srun" commands.

Is it possible in SLURM?

Thanks.
  




Re: [slurm-users] srun not allowed in a partition

2018-03-21 Thread sysadmin.caos
I'm trying to compile SLURM-17.02.7 with "lua" support executing 
"./configure && make && make contribs && make install", but make does 
nothing in src/plugins/job_submit/lua and I don't know why...


How do I have to compile that plugin? The rest of the plugins compile 
with no problems (defaults, all_partitions, partition,...)


Thanks



[slurm-users] job_submit.lua script

2018-04-11 Thread sysadmin.caos

  
  
Hello,

I'm writing my own "job_submit.lua" for controlling in what
partition a user can run a "srun" and how many CPUs and nodes are
allowed. I want to allow only "srun" in partition "interactive" with
only one core and one node. I have wrote this script but I'm getting
errors:
function slurm_job_submit(job_desc, part_list,
submit_uid)
      local partition = "interactive"
      if ((job_desc.script == nil or job_desc.script == '')
and job_desc.partition ~= partition) then
      slurm.log_info("slurm_job_submit: interactive
job submitted by user_id:%d to partition:%s rejected",
job_desc.user_id, job_desc.partition)
      return slurm.FAILURE
      end
      
        local max = 1
  
      if (job_desc.cpus_per_task > max or
job_desc.max_cpus > max or job_desc.max_nodes > max) then
      slurm.log_user("slurm_job_submit: parameter
error %s %s %u", job_desc.cpus_per_task, job_desc.max_cpus,
job_desc.max_nodes)
      return slurm.FAILURE
      end
  
      return slurm.SUCCESS
  end

  
It seems fields "cpus_per_task" and "max_cpus" are not being printed
correctly ("max_nodes" is allright), because after running "srun"
with this: 
      srun -p interactive.q -N 1 -n 1 -w aolin15 --pty bash
  
...I get this error:
      srun: error: slurm_job_submit: parameter error 65534
  4294967294 1
  
Fields "cpus_per_task" is uint16_t, while "max_cpus" and
"max_nodes" are uint32_t... but I have tried with "%u" and it didn't
work...

  
Could anybody help me?

Thanks.


  




Re: [slurm-users] job_submit.lua script

2018-04-12 Thread sysadmin.caos

  
  
My purpose with job_submit.lua script is to limit a "srun" with more
than one node and more than one CPU; in others words, "srun -N 1 -n
1". Because of this reason, in my future script I execute "if" for
comparing that values:
function slurm_job_submit(job_desc, part_list,
submit_uid)
      local partition = "interactive"
      if ((job_desc.script == nil or job_desc.script == '')
and job_desc.partition ~= partition) then
      slurm.log_info("slurm_job_submit: interactive
job submitted by user_id:%d to partition:%s rejected",
job_desc.user_id, job_desc.partition)
      return slurm.FAILURE
      end
     
      local max = 1
  
      if (job_desc.cpus_per_task >
max or job_desc.max_cpus > max or job_desc.max_nodes >
max) then
      slurm.log_user("slurm_job_submit: parameter
error %s %s %u", job_desc.cpus_per_task, job_desc.max_cpus,
job_desc.max_nodes)
      return slurm.FAILURE
      end
  
      return slurm.SUCCESS
  end

I undestand that:

  -N 1 --> job_desc.max_nodes=1
  -n 1 --> job_desc.max_cpus=1

Am I wrong?
Thanks.

  




[slurm-users] Limit job_submit.lua script for only srun

2018-04-25 Thread sysadmin.caos

Hello,

I have written my own job_submit.lua script for limiting "srun" 
executions to one processor, one task and one node. If I test it with 
"srun", all works fine. However, if now I try to run a sbatch job with 
"-N 12" or "-n 2", job_submit.lua is also checked and, then, my job is 
rejected because I'm requesting more than one task and more than one 
node. So, is it possible that lua script only active when user runs a 
"srun" and not a "sbatch"? I have been reading "typedef struct 
job_descriptor" at slurm/slurm.h file but there is no record that keeps 
command run by the user in the command line.


Thanks.



[slurm-users] sacct not shows user

2018-04-26 Thread sysadmin.caos

Hello,

when I run "sacct", output is this:
   JobID    JobName  Partition    Account  AllocCPUS State ExitCode
 -- -- -- -- -- 
[...]
2810    bas nodo.q (null) 0 FAILED  2:0
2811   bash nodo.q (null)  0 COMPLETED  0:0
2812 ModelRepa+ nodo.q (null)  0 CANCELLED  0:0
2813 ModelRepa+ nodo.q (null)  0 CANCELLED  0:0
2814 ModelRepa+ nodo.q (null)  0 CANCELLED  0:0
2815 ModelRepa+ nodo.q (null)  0 CANCELLED  0:0
2816 ModelRepa+ nodo.q (null)  0 CANCELLED  0:0
2817   bash nodo.q (null)  0 COMPLETED  0:0
2807 ModelRepa+ nodo.q (null)  0 CANCELLED  0:0
2808 ModelFLAM+ nodo.q (null)  0 CANCELLED  0:0
2818 ModelRepa+ nodo.q (null)  0 CANCELLED  0:0
2819   bash nodo.q (null)  0 COMPLETED  0:0
2820 ModelRepa+ nodo.q (null)  0 PENDING  0:0
[...]

It seems "Account" column always shows "(null)" value. Is it normal or 
my SLURM has a wrong configuration?


Thanks.



Re: [slurm-users] sacct not shows user

2018-04-27 Thread sysadmin.caos
I'm using AccountingStorageType=accounting_storage/filetxt because I'm 
running some tests. With "filetxt", could I get "account" (username) 
with sacct?




[slurm-users] Accounting not recording jobs

2018-05-08 Thread sysadmin.caos

  
  
Hello,

after configuring SLURM-17.11.5 with accouting/mysql, it seems
databse is not recording any job. If I run "sacct -", I get
this output:
sacct: Jobs eligible from Tue May 08 00:00:00 2018 - Now
  sacct: debug:  Options selected:
      opt_completion=0
      opt_dup=0
      opt_field_list=(null)
      opt_help=0
      opt_allocs=0
  sacct: debug3: Trying to load plugin
  /soft/slurm-17.11.5/lib/slurm/accounting_storage_slurmdbd.so
  sacct: Accounting storage SLURMDBD plugin loaded with
  AuthInfo=(null)
  sacct: debug3: Success.
  sacct: debug4: Accounting storage SLURMDBD plugin loaded
  sacct: debug3: Trying to load plugin
  /soft/slurm-17.11.5/lib/slurm/auth_munge.so
  sacct: debug:  Munge authentication plugin loaded
  sacct: debug3: Success.
  sacct: debug:  slurmdbd: Sent PersistInit msg
  sacct: debug2: Clusters requested:  q50004
  sacct: debug2: Userids requested:   all
     JobID    JobName  Partition    Account  AllocCPUS 
  State ExitCode
   -- -- -- --
  -- 
  sacct: debug:  slurmdbd: Sent fini msg
  

However, I supose it would have to appear a job I have submited (and
finished) a minute ago Why does it not appear?


My slurmdbd.conf is:
ArchiveEvents=yes
  ArchiveJobs=yes
  ArchiveSteps=no
  ArchiveSuspend=no
  AuthInfo=/var/run/munge/munge.socket.2
  AuthType=auth/munge
  DbdHost=localhost
  DebugLevel=4
  PurgeEventAfter=12month
  PurgeJobAfter=12month
  PurgeResvAfter=12month
  PurgeStepAfter=12month
  PurgeSuspendAfter=12month
  LogFile=/var/log/slurmdbd.log
  PidFile=/var/tmp/slurm/slurmdbd.pid
  SlurmUser=slurm
  StorageHost=localhost
  StorageLoc=slurmdb
  StoragePass=slurm
  StorageType=accounting_storage/mysql
  StorageUser=slurm


And my slurm.conf is:
[...]
  # LOGGING AND ACCOUNTING
  AccountingStorageType=accounting_storage/slurmdbd
  AccountingStorageLoc=/var/log/slurm/accounting
  JobCompType=jobcomp/filetxt
  JobCompLoc=/var/log/slurm/job_completions
  ClusterName=Q50004
  JobAcctGatherType=jobacct_gather/linux
  SlurmctldDebug=4
  SlurmctldLogFile=/var/log/slurmdctl.log
  SlurmdDebug=4
  SlurmdLogFile=/var/log/slurmd.log
  [...]


In file /var/log/slurm/accounting appear my last job... but I don't
undertand why job appears there while I have configured accounting
with "AccountingStorageType=accounting_storage/slurmdbd"


Thanks.

Thanks.