Re: [slurm-users] Limit run time of interactive jobs

2023-05-08 Thread Angel de Vicente
Hi,

Bjørn-Helge Mevik  writes:

> Wouldn't it be simpler to just refuse too long interactive jobs in
> job_submit.lua?

Yes, I guess. I proposed the idea of having different partitions because
then the constraints are at the level of the partition, which is
probably easier to handle than modifying the job_submit.lua script?, but
probably can get the same result both ways.

Anyway, the goal was to point to the job_submit.lua stuff, because
without it, even if creating separate partitions for batch and
interactive jobs, it is not possible (or at least I wouldn't know how)
to force a certain policy only for interactive jobs.

Cheers,
-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature


[slurm-users] New future and roadmap for Slurm-web

2023-05-08 Thread Rémi Palancher
Hi Slurm community,

Slurm-web is an open source web interface for Slurm workload manager : 
http://rackslab.github.io/slurm-web/

The project was born in 2015(*), it was originally funded by EDF [2] (huge 
thanks to them!) and it reached a nice and unique feature set with versions 
2.x. Unfortunately, the software has suffered during the last years from 
lowered maintenance and investment.

Today, Slurm-web is being endorsed by Rackslab[3], a small company focused on 
development of open source solutions for HPC operations, which becomes its new 
official maintainer. A new ambitious roadmap has been defined with long-term 
vision about this project, starting with version 3.0 coming later this year.

In addition to existing Slurm-web feature set, the following new features are 
planned:

- Near real-time updates of the dashboard
- Accounting reports and vizualisation on past jobs
- Built-in metrics about jobs and scheduling
- Job submission and inspection
- Vastly improved Gantt view
- GPGPU support
- QOS, associations and reservations management
- Native RPM/deb packages and containers for easy deployment on most Linux 
distributions

The software architecture will be reworked with modern established 
technologies, it will notably be based on reference slurmrestd REST API. The 
source code will remain free, published under GPLv3, in conformity with 
Rackslab commitment for free software community.

Our goal is clearly to build the reference open source web interface for all 
users of Slurm based HPC clusters.

More details about the roadmap has been published in project discussions on 
Github: https://github.com/rackslab/slurm-web/discussions/235

You are more than welcome to discuss about it there, ask questions and give 
comments!

Best regards,

(*) The original announcement can still be found in the archives of this 
mailing-list! [1]
[1] https://groups.google.com/g/slurm-users/c/LiD2Pa8r22A/m/fDHWm5GomJsJ
[2] https://www.edf.fr/en
[3] https://rackslab.io
--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io



Re: [slurm-users] Limit run time of interactive jobs

2023-05-08 Thread Ole Holm Nielsen

On 5/8/23 08:39, Bjørn-Helge Mevik wrote:

Angel de Vicente  writes:


But one possible way to something similar is to have a partition only
for interactive jobs and a different partition for batch jobs, and then
enforce that each job uses the right partition. In order to do this, I
think we can use the Lua contrib module (check the job_submit.lua
example).


Wouldn't it be simpler to just refuse too long interactive jobs in
job_submit.lua?


This sounds like a good idea, but how would one identify an interactive 
job in the job_submit.lua script?  A solution was suggested in 
https://serverfault.com/questions/1090689/how-can-i-set-up-interactive-job-only-or-batch-job-only-partition-on-a-slurm-clu 




Interactive jobs have no script and job_desc.script will be empty / not set.


So maybe something like this code snippet?

if job_desc.script == NIL then
   -- This is an interactive job
   -- make checks of job timelimit
   if job_desc.time_limit > 3600 then
 slurm.log_user("NOTICE: Interactive jobs are limited to 3600 seconds")
 -- ESLURM_INVALID_TIME_LIMIT in slurm_errno.h
 return 2051
   end
end

/Ole



Re: [slurm-users] Limit run time of interactive jobs

2023-05-08 Thread Bjørn-Helge Mevik
Ole Holm Nielsen  writes:

> On 5/8/23 08:39, Bjørn-Helge Mevik wrote:
>> Angel de Vicente  writes:
>> 
>>> But one possible way to something similar is to have a partition only
>>> for interactive jobs and a different partition for batch jobs, and then
>>> enforce that each job uses the right partition. In order to do this, I
>>> think we can use the Lua contrib module (check the job_submit.lua
>>> example).
>> Wouldn't it be simpler to just refuse too long interactive jobs in
>> job_submit.lua?
>
> This sounds like a good idea, but how would one identify an
> interactive job in the job_submit.lua script?

Good question. :)  I merely guessed it is possible. :)

> A solution was suggested in
> https://serverfault.com/questions/1090689/how-can-i-set-up-interactive-job-only-or-batch-job-only-partition-on-a-slurm-clu
>> Interactive jobs have no script and job_desc.script will be empty /
> not set.
>
> So maybe something like this code snippet?
>
> if job_desc.script == NIL then

That sounds like it should work, yes.  (But perhaps double check that jobs
submitted with "sbatch --wrap" or taking the job script from stdin (if
that is still possible) get job_descr.script set.)

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Best way to accurately calculate the CPU usage of an account when using fairshare?

2023-05-08 Thread Paul Edmon
I would recommend standing up an instance of XDMod as it handles most of 
this for you in its summary reports.



https://open.xdmod.org/10.0/index.html


-Paul Edmon-


On 5/3/23 2:05 PM, Joseph Francisco Guzman wrote:

Good morning,

We have at least one billed account right now, where the associated 
researchers are able to submit jobs that run against our normal queue 
with fairshare, but not for an academic research purpose. So we'd like 
to accurately calculate their CPU hours. We are currently using a 
script to query the db with sacct and sum up the value of ElapsedRaw * 
AllocCPUS for all jobs. But this seems limited, because requeueing 
will create what the sacct man page calls duplicates. By default jobs 
normally get requeued only if there's something outside of the user's 
control like a NODE_FAIL or an scontrol command to requeue it 
manually, though I think users can requeue things themselves, it's not 
a feature we've seen our researchers use.


However with the new scrontab feature, whenever the cron is executed 
more than once, sacct reports that the previous jobs are "requeued" 
and are only visible by looking up duplicates. I haven't seen any 
billed account use requeueing or scrontab yet, but it's clear to me 
that it could be significant once researchers start using scrontab 
more. Scrontab has existed since one of the releases from 2020 I 
believe, but we enabled it this year and see it as much more powerful 
than the traditional linux crontab.


What would be the best way to more thoroughly calculate ElapsedRaw * 
AllocCPUS, to account for duplicates, but optionally ignore 
unintentional requeueing like from a NODE_FAIL?


Here's the main loop of the simple bash script I have now:

while IFS='|' read -r end elapsed cpus; do
    # if a job crosses the month barrier
    # the entire bill will be put under the 2nd month
    year_month="${end:0:7}"
    if [[ ! "$elapsed" =~ ^[0-9]+$ ]] || [[ ! "$cpus" =~ ^[0-9]+$ ]]; then
        continue
    fi
    core_seconds["$year_month"]=$(( core_seconds["$year_month"] + 
(elapsed * cpus) ))

done < <(sacct -a -A "$SLURM_ACCOUNT" \
               -S "$START_DATE" \
               -E "$END_DATE" \
               -o End,ElapsedRaw,AllocCPUS -X -P --noheader)

Our slurmdbd is configured to keep 6 months of data.

It make senses to loop through the jobids instead, using sacct's 
-D/--duplicates option each time to reveal the hidden duplicates in 
the REQUEUED state, but I'm interested if there are alternatives or if 
I'm missing anything here.


Thanks,

Joseph

--
Joseph F. Guzman - ITS (Advanced Research Computing)

Northern Arizona University

joseph.f.guz...@nau.edu


Re: [slurm-users] Limit run time of interactive jobs

2023-05-08 Thread Angel de Vicente
Hello,

Bjørn-Helge Mevik  writes:

>> A solution was suggested in
>> https://serverfault.com/questions/1090689/how-can-i-set-up-interactive-job-only-or-batch-job-only-partition-on-a-slurm-clu
>>> Interactive jobs have no script and job_desc.script will be empty /
>> not set.
>>
>> So maybe something like this code snippet?
>>
>> if job_desc.script == NIL then

In my case (merely a variation from some older post here at
slurm-users), I'm using the following to make sure jobs go to the right
queue (either 'batch' or 'interactive'), and it seems to work just
fine:


 if (job_desc.script == nil or job_desc.script == '') then
if (job_desc.partition ~= interactive_partition) then
  job_desc.partition = interactive_partition
  slurm.log_user("%s: normal job seems to be interactive, moved to %s 
partition.", log_prefix, job_desc.partition)
end
 else
if (job_desc.partition == interactive_partition) then
  job_desc.partition = batch_partition
  slurm.log_user("%s: batch jobs cannot be run in the interactive 
partition, moved to %s partition.", log_prefix, job_desc.partition)
end
 end


-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature