date:20240819

[slurm-users] Re: Access to --constraint= in Lua cli_filter?

2024-08-19 Thread Ward Poelmans via slurm-users


Hi Kevin,

On 19/08/2024 08:15, Kevin Buckley via slurm-users wrote:

If I supply a

   --constraint=

option to an sbatch/salloc/srun, does the arg appear inside
any object that a Lua CLI Filter could access?


Have a look if you can spot them in:
function slurm_cli_pre_submit(options, pack_offset)
  env_json = slurm.json_env()
  slurm.log_info("ENV: %s", env_json)
  opt_json = slurm.json_cli_options(options)
  slurm.log_info("OPTIONS: %s", opt_json)
end

I thought all options could be access in the cli filter.

Ward


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Arko Roy via slurm-users

Thanks Loris and Gareth. here is the job submission script. if you find any
errors please let me know.
since i am not the admin but just an user, i think i dont have access to
the prolog and epilogue files.

If the jobs are independent, why do you want to run them all on the same
node?
I am running sequential codes. Essentially 50 copies of the same node with
a variation in parameter.
Since I am using the Slurm scheduler, the nodes and cores are allocated
depending upon the
available resources. So there are instances, when 20 of them goes to 20
free cores located on a particular
node and the rest 30 goes to the free 30 cores on another node. It turns
out that only 1 job out of 20 and 1 job
out of 30 are completed succesfully with exitcode 0 and the rest gets
terminated with exitcode 9.
for information, i run sjobexitmod -l jobid to check the exitcodes.

--
the submission script is as follows:



#!/bin/bash

# Setting slurm options



# lines starting with "#SBATCH" define your jobs parameters
# requesting the type of node on which to run job
##SBATCH --partition 
#SBATCH --partition=standard

# telling slurm how many instances of this job to spawn (typically 1)

##SBATCH --ntasks 
##SBATCH --ntasks=1
#SBATCH --nodes=1
##SBATCH -N 1
##SBATCH --ntasks-per-node=1



# setting number of CPUs per task (1 for serial jobs)

##SBATCH --cpus-per-task 

##SBATCH --cpus-per-task=1

# setting memory requirements

##SBATCH --mem-per-cpu 
#SBATCH --mem-per-cpu=1G

# propagating max time for job to run

##SBATCH --time 
##SBATCH --time 
##SBATCH --time 
#SBATCH --time 10:0:0
#SBATCH --job-name gstate

#module load compiler/intel/2018_4
module load fftw-3.3.10-intel-2021.6.0-ppbepka
echo "Running on $(hostname)"
echo "We are in $(pwd)"



# run the program

/home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Loris Bennett via slurm-users

Dear Arko,

Arko Roy  writes:

> Thanks Loris and Gareth. here is the job submission script. if you find any 
> errors please let me know.
> since i am not the admin but just an user, i think i dont have access to the 
> prolog and epilogue files.
>
> If the jobs are independent, why do you want to run them all on the same
> node?
> I am running sequential codes. Essentially 50 copies of the same node with a 
> variation in parameter.
> Since I am using the Slurm scheduler, the nodes and cores are allocated 
> depending upon the
> available resources. So there are instances, when 20 of them goes to 20 free 
> cores located on a particular 
> node and the rest 30 goes to the free 30 cores on another node. It turns out 
> that only 1 job out of 20 and 1 job 
> out of 30 are completed succesfully with exitcode 0 and the rest gets 
> terminated with exitcode 9.
> for information, i run sjobexitmod -l jobid to check the exitcodes.
>
> --
> the submission script is as follows:
>
> #!/bin/bash
> 
> # Setting slurm options
> 
>
> # lines starting with "#SBATCH" define your jobs parameters
> # requesting the type of node on which to run job
> ##SBATCH --partition 
> #SBATCH --partition=standard
>
> # telling slurm how many instances of this job to spawn (typically 1)
>
> ##SBATCH --ntasks 
> ##SBATCH --ntasks=1
> #SBATCH --nodes=1
> ##SBATCH -N 1
> ##SBATCH --ntasks-per-node=1
>
> # setting number of CPUs per task (1 for serial jobs)
>
> ##SBATCH --cpus-per-task 
>
> ##SBATCH --cpus-per-task=1
>
> # setting memory requirements
>
> ##SBATCH --mem-per-cpu 
> #SBATCH --mem-per-cpu=1G
>
> # propagating max time for job to run
>
> ##SBATCH --time 
> ##SBATCH --time 
> ##SBATCH --time 
> #SBATCH --time 10:0:0
> #SBATCH --job-name gstate
>
> #module load compiler/intel/2018_4
> module load fftw-3.3.10-intel-2021.6.0-ppbepka
> echo "Running on $(hostname)"
> echo "We are in $(pwd)"
>
> 
> # run the program
> 
> /home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &

You should not write 

  &

at the end of the above command.  This will run your program in the
background, which will cause the submit script to terminate, which in
turn will terminate your job.

Regards

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT, Freie Universität Berlin

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Arko Roy via slurm-users

Dear Loris,

I just checked removing the &
it didn't work.

On Mon, Aug 19, 2024 at 1:43 PM Loris Bennett 
wrote:

> Dear Arko,
>
> Arko Roy  writes:
>
> > Thanks Loris and Gareth. here is the job submission script. if you find
> any errors please let me know.
> > since i am not the admin but just an user, i think i dont have access to
> the prolog and epilogue files.
> >
> > If the jobs are independent, why do you want to run them all on the same
> > node?
> > I am running sequential codes. Essentially 50 copies of the same node
> with a variation in parameter.
> > Since I am using the Slurm scheduler, the nodes and cores are allocated
> depending upon the
> > available resources. So there are instances, when 20 of them goes to 20
> free cores located on a particular
> > node and the rest 30 goes to the free 30 cores on another node. It turns
> out that only 1 job out of 20 and 1 job
> > out of 30 are completed succesfully with exitcode 0 and the rest gets
> terminated with exitcode 9.
> > for information, i run sjobexitmod -l jobid to check the exitcodes.
> >
> > --
> > the submission script is as follows:
> >
> > #!/bin/bash
> > 
> > # Setting slurm options
> > 
> >
> > # lines starting with "#SBATCH" define your jobs parameters
> > # requesting the type of node on which to run job
> > ##SBATCH --partition 
> > #SBATCH --partition=standard
> >
> > # telling slurm how many instances of this job to spawn (typically 1)
> >
> > ##SBATCH --ntasks 
> > ##SBATCH --ntasks=1
> > #SBATCH --nodes=1
> > ##SBATCH -N 1
> > ##SBATCH --ntasks-per-node=1
> >
> > # setting number of CPUs per task (1 for serial jobs)
> >
> > ##SBATCH --cpus-per-task 
> >
> > ##SBATCH --cpus-per-task=1
> >
> > # setting memory requirements
> >
> > ##SBATCH --mem-per-cpu 
> > #SBATCH --mem-per-cpu=1G
> >
> > # propagating max time for job to run
> >
> > ##SBATCH --time 
> > ##SBATCH --time 
> > ##SBATCH --time 
> > #SBATCH --time 10:0:0
> > #SBATCH --job-name gstate
> >
> > #module load compiler/intel/2018_4
> > module load fftw-3.3.10-intel-2021.6.0-ppbepka
> > echo "Running on $(hostname)"
> > echo "We are in $(pwd)"
> >
> > 
> > # run the program
> > 
> > /home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &
>
> You should not write
>
>   &
>
> at the end of the above command.  This will run your program in the
> background, which will cause the submit script to terminate, which in
> turn will terminate your job.
>
> Regards
>
> Loris
>
> --
> Dr. Loris Bennett (Herr/Mr)
> FUB-IT, Freie Universität Berlin
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Davide DelVento via slurm-users

Since each instance of the program is independent and you are using one
core for each, it'd be better to leave slurm deal with that and schedule
them concurrently as it sees fit. Maybe you simply need to add some
directive to allow shared jobs on the same node.
Alternatively (if at your site jobs must be exclusive) you have to check
what it is their recommended way to perform this. Some sites prefer dask,
some other an MPI-based serial-job consolidation (often called "command
file") some others a technique similar to what you are doing, but instead
of reinventing the wheel I suggest to check what your site recommends in
this situation

On Mon, Aug 19, 2024 at 2:24 AM Arko Roy via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Dear Loris,
>
> I just checked removing the &
> it didn't work.
>
> On Mon, Aug 19, 2024 at 1:43 PM Loris Bennett 
> wrote:
>
>> Dear Arko,
>>
>> Arko Roy  writes:
>>
>> > Thanks Loris and Gareth. here is the job submission script. if you find
>> any errors please let me know.
>> > since i am not the admin but just an user, i think i dont have access
>> to the prolog and epilogue files.
>> >
>> > If the jobs are independent, why do you want to run them all on the same
>> > node?
>> > I am running sequential codes. Essentially 50 copies of the same node
>> with a variation in parameter.
>> > Since I am using the Slurm scheduler, the nodes and cores are allocated
>> depending upon the
>> > available resources. So there are instances, when 20 of them goes to 20
>> free cores located on a particular
>> > node and the rest 30 goes to the free 30 cores on another node. It
>> turns out that only 1 job out of 20 and 1 job
>> > out of 30 are completed succesfully with exitcode 0 and the rest gets
>> terminated with exitcode 9.
>> > for information, i run sjobexitmod -l jobid to check the exitcodes.
>> >
>> > --
>> > the submission script is as follows:
>> >
>> > #!/bin/bash
>> > 
>> > # Setting slurm options
>> > 
>> >
>> > # lines starting with "#SBATCH" define your jobs parameters
>> > # requesting the type of node on which to run job
>> > ##SBATCH --partition 
>> > #SBATCH --partition=standard
>> >
>> > # telling slurm how many instances of this job to spawn (typically 1)
>> >
>> > ##SBATCH --ntasks 
>> > ##SBATCH --ntasks=1
>> > #SBATCH --nodes=1
>> > ##SBATCH -N 1
>> > ##SBATCH --ntasks-per-node=1
>> >
>> > # setting number of CPUs per task (1 for serial jobs)
>> >
>> > ##SBATCH --cpus-per-task 
>> >
>> > ##SBATCH --cpus-per-task=1
>> >
>> > # setting memory requirements
>> >
>> > ##SBATCH --mem-per-cpu 
>> > #SBATCH --mem-per-cpu=1G
>> >
>> > # propagating max time for job to run
>> >
>> > ##SBATCH --time 
>> > ##SBATCH --time 
>> > ##SBATCH --time 
>> > #SBATCH --time 10:0:0
>> > #SBATCH --job-name gstate
>> >
>> > #module load compiler/intel/2018_4
>> > module load fftw-3.3.10-intel-2021.6.0-ppbepka
>> > echo "Running on $(hostname)"
>> > echo "We are in $(pwd)"
>> >
>> > 
>> > # run the program
>> > 
>> > /home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &
>>
>> You should not write
>>
>>   &
>>
>> at the end of the above command.  This will run your program in the
>> background, which will cause the submit script to terminate, which in
>> turn will terminate your job.
>>
>> Regards
>>
>> Loris
>>
>> --
>> Dr. Loris Bennett (Herr/Mr)
>> FUB-IT, Freie Universität Berlin
>>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Brian Andrus via slurm-users

IIRC, slurm parses the batch file as options until it hits the first 
non-comment line, which includes blank lines.


You may want to double-check some of the gaps in the option section of 
your batch script.


That being said and you say you removed the '&' at the end of the 
command, which would help.


If they are all exiting with exit code 9, you need to look at the code 
for your a.out to see what code 9 means, as that is who is raising that 
error. Slurm sees that and if it is non-zero, it interprets it as a 
failed job.


Brian Andrus

On 8/19/2024 12:50 AM, Arko Roy via slurm-users wrote:
Thanks Loris and Gareth. here is the job submission script. if you 
find any errors please let me know.
since i am not the admin but just an user, i think i dont have access 
to the prolog and epilogue files.


If the jobs are independent, why do you want to run them all on the same
node?
I am running sequential codes. Essentially 50 copies of the same node 
with a variation in parameter.
Since I am using the Slurm scheduler, the nodes and cores are 
allocated depending upon the
available resources. So there are instances, when 20 of them goes to 
20 free cores located on a particular
node and the rest 30 goes to the free 30 cores on another node. It 
turns out that only 1 job out of 20 and 1 job
out of 30 are completed succesfully with exitcode 0 and the rest gets 
terminated with exitcode 9.

for information, i run sjobexitmod -l jobid to check the exitcodes.

--
the submission script is as follows:



#!/bin/bash

# Setting slurm options



# lines starting with "#SBATCH" define your jobs parameters
# requesting the type of node on which to run job
##SBATCH --partition 
#SBATCH --partition=standard

# telling slurm how many instances of this job to spawn (typically 1)

##SBATCH --ntasks 
##SBATCH --ntasks=1
#SBATCH --nodes=1
##SBATCH -N 1
##SBATCH --ntasks-per-node=1



# setting number of CPUs per task (1 for serial jobs)

##SBATCH --cpus-per-task 

##SBATCH --cpus-per-task=1

# setting memory requirements

##SBATCH --mem-per-cpu 
#SBATCH --mem-per-cpu=1G

# propagating max time for job to run

##SBATCH --time 
##SBATCH --time 
##SBATCH --time 
#SBATCH --time 10:0:0
#SBATCH --job-name gstate

#module load compiler/intel/2018_4
module load fftw-3.3.10-intel-2021.6.0-ppbepka
echo "Running on $(hostname)"
echo "We are in $(pwd)"



# run the program

/home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Bjørn-Helge Mevik via slurm-users

Brian Andrus via slurm-users  writes:

> IIRC, slurm parses the batch file as options until it hits the first
> non-comment line, which includes blank lines.

Blank lines do not stop sbatch from parsing the file.  (But commands
do.)

-- 
B/H


signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Access to --constraint= in Lua cli_filter?

2024-08-19 Thread Kevin Buckley via slurm-users


On 2024/08/19 15:11, Ward Poelmans via slurm-users wrote:


Have a look if you can spot them in:
function slurm_cli_pre_submit(options, pack_offset)
env_json = slurm.json_env()
slurm.log_info("ENV: %s", env_json)
opt_json = slurm.json_cli_options(options)
slurm.log_info("OPTIONS: %s", opt_json)
end

I thought all options could be access in the cli filter.


Cheers Ward, however, I'd already dumped the options array
(OK: it's Lua so make that: table) and not see anything,
hence wondering if constraints might be in their own
object/array/table.

But no matter: something I spotted in the options["args"]
array/table has since given me something reproducible to
"key off", so that I can take a different path through the
filter logic, when that is seen, which is what I was hoping
to do by passing a constraint in.

There's usually more than one way to skin a cat: and this
cat is now skinless!

Cheers again,
Kevin

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Access to --constraint= in Lua cli_filter?

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

[slurm-users] Re: Access to --constraint= in Lua cli_filter?

8 matches

Site Navigation

Mail list logo

Footer information