Hi Hermann, Hermann Schwärzler <hermann.schwaerz...@uibk.ac.at> writes:
> Hi Loris, > hi Sebastian, > > thanks for the information on how you are doing this. > So you both are happily(?) ignoring this warning the "Prolog and Epilog > Guide", > right? :-) > > "Prolog and Epilog scripts [...] should not call Slurm commands (e.g. squeue, > scontrol, sacctmgr, etc)." We have probably been doing this since before the warning was added to the documentation. So we are "ignorantly ignoring" the advice :-/ > May I ask how big your clusters are (number of nodes) and how heavily they are > used (submitted jobs per hour)? We have around 190 32-core nodes. I don't know how I would easily find out the average number of jobs per hour. The only problems we have had with submission have been when people have written their own mechanisms for submitting thousands of jobs. Once we get them to use job array, such problems generally disappear. Cheers, Loris > Regards, > Hermann > > On 9/16/22 9:09 AM, Loris Bennett wrote: >> Hi Hermann, >> Sebastian Potthoff <s.potth...@uni-muenster.de> writes: >> >>> Hi Hermann, >>> >>> I happened to read along this conversation and was just solving this issue >>> today. I added this part to the epilog script to make it work: >>> >>> # Add job report to stdout >>> StdOut=$(/usr/bin/scontrol show job=$SLURM_JOB_ID | /usr/bin/grep StdOut | >>> /usr/bin/xargs | /usr/bin/awk 'BEGIN { FS = "=" } ; { print $2 }') >>> >>> NODELIST=($(/usr/bin/scontrol show hostnames)) >>> >>> # Only add to StdOut file if it exists and if we are the first node >>> if [ "$(/usr/bin/hostname -s)" = "${NODELIST[0]}" -a ! -z "${StdOut}" ] >>> then >>> echo "################################# JOB REPORT >>> ##################################" >> $StdOut >>> /usr/bin/seff $SLURM_JOB_ID >> $StdOut >>> echo >>> "###############################################################################" >>> >> $StdOut >>> fi >> We do something similar. At the end of our script pointed to by >> EpilogSlurmctld we have >> OUT=`scontrol show jobid ${job_id} | awk -F= '/ StdOut/{print $2}'` >> if [ ! -f "$OUT" ]; then >> exit >> fi >> printf "\n== Epilog Slurmctld >> ==================================================\n\n" >> ${OUT} >> seff ${SLURM_JOB_ID} >> ${OUT} >> printf >> "\n======================================================================\n" >> >> ${OUT} >> chown ${user} ${OUT} >> Cheers, >> Loris >> >>> Contrary to what it says in the slurm docs >>> https://slurm.schedmd.com/prolog_epilog.html I was not able to use the env >>> var SLURM_JOB_STDOUT, so I had to fetch it via scontrol. In addition I had >>> to >>> make sure it is only called by the „leading“ node as the epilog script will >>> be called by ALL nodes of a multinode job and they would all call seff and >>> clutter up the output. Last thing was to check if StdOut is >>> not of length zero (i.e. it exists). Interactive jobs would otherwise cause >>> the node to drain. >>> >>> Maybe this helps. >>> >>> Kind regards >>> Sebastian >>> >>> PS: goslmailer looks quite nice with its recommendations! Will definitely >>> look into it. >>> >>> -- >>> Westfälische Wilhelms-Universität (WWU) Münster >>> WWU IT >>> Sebastian Potthoff (eScience / HPC) >>> >>> Am 15.09.2022 um 18:07 schrieb Hermann Schwärzler >>> <hermann.schwaerz...@uibk.ac.at>: >>> >>> Hi Ole, >>> >>> On 9/15/22 5:21 PM, Ole Holm Nielsen wrote: >>> >>> On 15-09-2022 16:08, Hermann Schwärzler wrote: >>> >>> Just out of curiosity: how do you insert the output of seff into the >>> out-file of a job? >>> >>> Use the "smail" tool from the slurm-contribs RPM and set this in >>> slurm.conf: >>> MailProg=/usr/bin/smail >>> >>> Maybe I am missing something but from what I can tell smail sends an >>> email and does *not* change or append to the .out file of a job... >>> >>> Regards, >>> Hermann >> > -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de