On Fri, Sep 16, 2022 at 3:43 PM Sebastian Potthoff < s.potth...@uni-muenster.de> wrote:
> Hi Hermann, > > So you both are happily(?) ignoring this warning the "Prolog and Epilog > Guide", > right? :-) > > "Prolog and Epilog scripts [...] should not call Slurm commands (e.g. > squeue, > scontrol, sacctmgr, etc)." > > > We have probably been doing this since before the warning was added to > the documentation. So we are "ignorantly ignoring" the advice :-/ > > > Same here :) But if $SLURM_JOB_STDOUT is not defined as documented … what > can you do. > FYI: SLURM_JOB_STDOUT among other ENV variables was added in 22.05 (see https://slurm.schedmd.com/news.html) so it might not be available if you have an older SLURM version. > > May I ask how big your clusters are (number of nodes) and how heavily they > are > used (submitted jobs per hour)? > > > We have around 500 nodes (mostly 2x18 cores). Jobs ending (i.e. calling > the epilog script) varies quite a lot between 1000 and 15k a day, so > something in between 40 and 625 Jobs/hour. During those peaks Slurm can > become noticeably slower, however usually it runs fine. > > Sebastian > > Am 16.09.2022 um 15:15 schrieb Loris Bennett <loris.benn...@fu-berlin.de>: > > Hi Hermann, > > Hermann Schwärzler <hermann.schwaerz...@uibk.ac.at> writes: > > Hi Loris, > hi Sebastian, > > thanks for the information on how you are doing this. > So you both are happily(?) ignoring this warning the "Prolog and Epilog > Guide", > right? :-) > > "Prolog and Epilog scripts [...] should not call Slurm commands (e.g. > squeue, > scontrol, sacctmgr, etc)." > > > We have probably been doing this since before the warning was added to > the documentation. So we are "ignorantly ignoring" the advice :-/ > > May I ask how big your clusters are (number of nodes) and how heavily they > are > used (submitted jobs per hour)? > > > We have around 190 32-core nodes. I don't know how I would easily find > out the average number of jobs per hour. The only problems we have had > with submission have been when people have written their own mechanisms > for submitting thousands of jobs. Once we get them to use job array, > such problems generally disappear. > > Cheers, > > Loris > > Regards, > Hermann > > On 9/16/22 9:09 AM, Loris Bennett wrote: > > Hi Hermann, > Sebastian Potthoff <s.potth...@uni-muenster.de> writes: > > Hi Hermann, > > I happened to read along this conversation and was just solving this issue > today. I added this part to the epilog script to make it work: > > # Add job report to stdout > StdOut=$(/usr/bin/scontrol show job=$SLURM_JOB_ID | /usr/bin/grep StdOut | > /usr/bin/xargs | /usr/bin/awk 'BEGIN { FS = "=" } ; { print $2 }') > > NODELIST=($(/usr/bin/scontrol show hostnames)) > > # Only add to StdOut file if it exists and if we are the first node > if [ "$(/usr/bin/hostname -s)" = "${NODELIST[0]}" -a ! -z "${StdOut}" ] > then > echo "################################# JOB REPORT > ##################################" >> $StdOut > /usr/bin/seff $SLURM_JOB_ID >> $StdOut > echo > "###############################################################################" > >> $StdOut > fi > > We do something similar. At the end of our script pointed to by > EpilogSlurmctld we have > OUT=`scontrol show jobid ${job_id} | awk -F= '/ StdOut/{print $2}'` > if [ ! -f "$OUT" ]; then > exit > fi > printf "\n== Epilog Slurmctld > ==================================================\n\n" >> ${OUT} > seff ${SLURM_JOB_ID} >> ${OUT} > printf > > "\n======================================================================\n" > > ${OUT} > > chown ${user} ${OUT} > Cheers, > Loris > > Contrary to what it says in the slurm docs > https://slurm.schedmd.com/prolog_epilog.html I was not able to use the > env var SLURM_JOB_STDOUT, so I had to fetch it via scontrol. In addition I > had to > make sure it is only called by the „leading“ node as the epilog script > will be called by ALL nodes of a multinode job and they would all call seff > and clutter up the output. Last thing was to check if StdOut is > not of length zero (i.e. it exists). Interactive jobs would otherwise > cause the node to drain. > > Maybe this helps. > > Kind regards > Sebastian > > PS: goslmailer looks quite nice with its recommendations! Will definitely > look into it. > > -- > Westfälische Wilhelms-Universität (WWU) Münster > WWU IT > Sebastian Potthoff (eScience / HPC) > > Am 15.09.2022 um 18:07 schrieb Hermann Schwärzler < > hermann.schwaerz...@uibk.ac.at>: > > Hi Ole, > > On 9/15/22 5:21 PM, Ole Holm Nielsen wrote: > > On 15-09-2022 16:08, Hermann Schwärzler wrote: > > Just out of curiosity: how do you insert the output of seff into the > out-file of a job? > > Use the "smail" tool from the slurm-contribs RPM and set this in > slurm.conf: > MailProg=/usr/bin/smail > > Maybe I am missing something but from what I can tell smail sends an > email and does *not* change or append to the .out file of a job... > > Regards, > Hermann > > > > -- > Dr. Loris Bennett (Herr/Mr) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de > > >