Re: [slurm-users] Providing users with info on wait time vs. run time

Ümit Seren Fri, 16 Sep 2022 06:47:05 -0700

On Fri, Sep 16, 2022 at 3:43 PM Sebastian Potthoff <
s.potth...@uni-muenster.de> wrote:


> Hi Hermann,
>
> So you both are happily(?) ignoring this warning the "Prolog and Epilog
> Guide",
> right? :-)
>
> "Prolog and Epilog scripts [...] should not call Slurm commands (e.g.
> squeue,
> scontrol, sacctmgr, etc)."
>
>
> We have probably been doing this since before the warning was added to
> the documentation.  So we are "ignorantly ignoring" the advice :-/
>
>
> Same here :) But if $SLURM_JOB_STDOUT is not defined as documented … what
> can you do.
>

FYI: SLURM_JOB_STDOUT among other ENV variables was added in 22.05 (see
https://slurm.schedmd.com/news.html) so it might not be available if you
have an older SLURM version.



>
> May I ask how big your clusters are (number of nodes) and how heavily they
> are
> used (submitted jobs per hour)?
>
>
> We have around 500 nodes (mostly 2x18 cores). Jobs ending (i.e. calling
> the epilog script) varies quite a lot between 1000 and 15k a day, so
> something in between 40 and 625 Jobs/hour. During those peaks Slurm can
> become noticeably slower, however usually it runs fine.
>
> Sebastian
>
> Am 16.09.2022 um 15:15 schrieb Loris Bennett <loris.benn...@fu-berlin.de>:
>
> Hi Hermann,
>
> Hermann Schwärzler <hermann.schwaerz...@uibk.ac.at> writes:
>
> Hi Loris,
> hi Sebastian,
>
> thanks for the information on how you are doing this.
> So you both are happily(?) ignoring this warning the "Prolog and Epilog
> Guide",
> right? :-)
>
> "Prolog and Epilog scripts [...] should not call Slurm commands (e.g.
> squeue,
> scontrol, sacctmgr, etc)."
>
>
> We have probably been doing this since before the warning was added to
> the documentation.  So we are "ignorantly ignoring" the advice :-/
>
> May I ask how big your clusters are (number of nodes) and how heavily they
> are
> used (submitted jobs per hour)?
>
>
> We have around 190 32-core nodes.  I don't know how I would easily find
> out the average number of jobs per hour.  The only problems we have had
> with submission have been when people have written their own mechanisms
> for submitting thousands of jobs.  Once we get them to use job array,
> such problems generally disappear.
>
> Cheers,
>
> Loris
>
> Regards,
> Hermann
>
> On 9/16/22 9:09 AM, Loris Bennett wrote:
>
> Hi Hermann,
> Sebastian Potthoff <s.potth...@uni-muenster.de> writes:
>
> Hi Hermann,
>
> I happened to read along this conversation and was just solving this issue
> today. I added this part to the epilog script to make it work:
>
> # Add job report to stdout
> StdOut=$(/usr/bin/scontrol show job=$SLURM_JOB_ID | /usr/bin/grep StdOut |
> /usr/bin/xargs | /usr/bin/awk 'BEGIN { FS = "=" } ; { print $2 }')
>
> NODELIST=($(/usr/bin/scontrol show hostnames))
>
> # Only add to StdOut file if it exists and if we are the first node
> if [ "$(/usr/bin/hostname -s)" = "${NODELIST[0]}" -a ! -z "${StdOut}" ]
> then
>   echo "################################# JOB REPORT
> ##################################" >> $StdOut
>   /usr/bin/seff $SLURM_JOB_ID >> $StdOut
>   echo
> "###############################################################################"
> >> $StdOut
> fi
>
> We do something similar.  At the end of our script pointed to by
> EpilogSlurmctld we have
>   OUT=`scontrol show jobid ${job_id} | awk -F= '/ StdOut/{print $2}'`
>   if [ ! -f "$OUT" ]; then
>     exit
>   fi
>   printf "\n== Epilog Slurmctld
> ==================================================\n\n" >>  ${OUT}
>   seff ${SLURM_JOB_ID} >> ${OUT}
>   printf
>
> "\n======================================================================\n"
>
> ${OUT}
>
>   chown ${user} ${OUT}
> Cheers,
> Loris
>
>   Contrary to what it says in the slurm docs
> https://slurm.schedmd.com/prolog_epilog.html  I was not able to use the
> env var SLURM_JOB_STDOUT, so I had to fetch it via scontrol. In addition I
> had to
> make sure it is only called by the „leading“ node as the epilog script
> will be called by ALL nodes of a multinode job and they would all call seff
> and clutter up the output. Last thing was to check if StdOut is
> not of length zero (i.e. it exists). Interactive jobs would otherwise
> cause the node to drain.
>
> Maybe this helps.
>
> Kind regards
> Sebastian
>
> PS: goslmailer looks quite nice with its recommendations! Will definitely
> look into it.
>
> --
> Westfälische Wilhelms-Universität (WWU) Münster
> WWU IT
> Sebastian Potthoff (eScience / HPC)
>
>  Am 15.09.2022 um 18:07 schrieb Hermann Schwärzler <
> hermann.schwaerz...@uibk.ac.at>:
>
>  Hi Ole,
>
>  On 9/15/22 5:21 PM, Ole Holm Nielsen wrote:
>
>  On 15-09-2022 16:08, Hermann Schwärzler wrote:
>
>  Just out of curiosity: how do you insert the output of seff into the
> out-file of a job?
>
>  Use the "smail" tool from the slurm-contribs RPM and set this in
> slurm.conf:
>  MailProg=/usr/bin/smail
>
>  Maybe I am missing something but from what I can tell smail sends an
> email and does *not* change or append to the .out file of a job...
>
>  Regards,
>  Hermann
>
>
>
> --
> Dr. Loris Bennett (Herr/Mr)
> ZEDAT, Freie Universität Berlin         Email loris.benn...@fu-berlin.de
>
>
>

Re: [slurm-users] Providing users with info on wait time vs. run time

Reply via email to