Hi Martin, I faced a similar problem where I had to deal with a huge taskfarm (1000s of tasks processing 1TB of satellite data) with varying run times and memory requirements. I ended up writing a REST server that hands out tasks to clients. I then simply fired up an array job where each job would request new tasks from the task server until either all tasks were processed or it was killed when it exceeded run time or memory. The system keeps track of completed tasks and running tasks so that you can reschedule tasks that didn't complete. Code is available on github and paper describing the service is here: https://openresearchsoftware.metajnl.com/articles/10.5334/jors.393/ Cheers magnus
-----Original Message----- From: "Ohlerich, Martin" <martin.ohler...@lrz.de> Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com> To: slurm-us...@schedmd.com <slurm-us...@schedmd.com>, Slurm User Community List <slurm-users@lists.schedmd.com> Subject: [ext] Re: [slurm-users] srun jobfarming hassle question Date: Wed, 18 Jan 2023 13:39:30 +0000 Hello Björn-Helge. Sigh ... First of all, of course, many thanks! This indeed helped a lot! Two comments: a) Why are Interfaces at Slurm tools changed? I once learned that the Interfaces must be designed to be as stable as possible. Otherwise, users get frustrated and go away. b) This only works if I have to specify --mem for a task. Although manageable, I wonder why one needs to be that restrictive. In principle, in the use case outlined, one task could use a bit less memory, and the other may require a bit more the half of the node's available memory. (So clearly this isn't always predictable.) I only hope that in such cases the second task does not die from OOM ... (I will know soon, I guess.) Really, thank you! Was a very helpful hint! Cheers, Martin Von: slurm-users <slurm-users-boun...@lists.schedmd.com> im Auftrag von Bjørn-Helge Mevik <b.h.me...@usit.uio.no> Gesendet: Mittwoch, 18. Januar 2023 13:49 An: slurm-us...@schedmd.com Betreff: Re: [slurm-users] srun jobfarming hassle question "Ohlerich, Martin" <martin.ohler...@lrz.de> writes: > Dear Colleagues, > > > already for quite some years now are we again and again facing issues on our clusters with so-called job-farming (or task-farming) concepts in Slurm jobs using srun. And it bothers me that we can hardly help users with requests in this regard. > > > From the documentation (https://slurm.schedmd.com/srun.html#SECTION_EXAMPLES), it reads like this. > > -------------------------------------------> > > ... > > #SBATCH --nodes=?? > > ... > > srun -N 1 -n 2 ... prog1 &> log.1 & > > srun -N 1 -n 1 ... prog2 &> log.2 & Unfortunately, that part of the documentation is not quite up-to-date. The semantics of srun has changed a little the last couple of years/Slurm versions, so today, you have to use "srun --exact ...". From "man srun" (version 21.08): --exact Allow a step access to only the resources requested for the step. By default, all non-GRES resources on each node in the step allocation will be used. This option only applies to step allocations. NOTE: Parallel steps will either be blocked or rejected until requested step resources are available unless -- over‐ lap is specified. Job resources can be held after the com‐ pletion of an srun command while Slurm does job cleanup. Step epilogs and/or SPANK plugins can further delay the release of step resources. -- Magnus Hagdorn Charité – Universitätsmedizin Berlin Geschäftsbereich IT | Scientific Computing Campus Charité Virchow Klinikum Forum 4 | Ebene 02 | Raum 2.020 Augustenburger Platz 1 13353 Berlin magnus.hagd...@charite.de https://www.charite.de HPC Helpdesk: sc-hpc-helpd...@charite.de
smime.p7s
Description: S/MIME cryptographic signature