Dear SLUR Users and Administrators,
I am interested in a way to customize the job submission exit statuses (mainly 
error codes) after the job has already been queued by the SLURM controller. We 
aim to provide more user-friendly messages and reminders in case of any errors 
or obstacles (also adjusted to our QoS/account system). 

For example, in the case of exceeding CPU minutes of given QoS (or account) and 
after the (successful) job submission, we would like to notify the user that 
his job has been queued (as it should) but won’t start until the CPU minutes 
limits are increased (and that he should contact the administrators to apply 
for more resources). Similarly, if the user queued a job that cannot be 
launched immediately because of exceeding the MaxJobs limit (per user), we 
would like to also give him some additional message after the srun/sbatch 
submission. We want to provide such information immediately after the job 
submission, without the need to check the status using `squeue` by the user. 

In the Job Launch Guide (https://slurm.schedmd.com/job_launch.html) there are 
distinguished following steps:

1. Call job_submit plugins to modify the request as appropriate

2. Validate that the options are valid for this user (e.g. valid partition 
name, valid limits, etc.)

3. Determine if this job is the highest priority runnable job, if so then 
really try to allocate resources for it now, otherwise only validate that it 
could run if no other jobs existed

4. Determine which nodes could be used for the job. If the feature 
specification uses an exclusive OR option, then multiple iterations of the 
selection process below will be required with disjoint sets of nodes

5. Call the select plugin to select the best resources for the request

6. The select plugin will consider network topology and the topology within a 
node (e.g. sockets, cores, and threads) to select the best resources for the job

7. If the job can not be initiated using available resources and preemption 
support is configured, the select plugin will also determine if the job can be 
initiated after preempting lower priority jobs. If so then initiate preemption 
as needed to start the job.

>From my understanding, to achieve our goal one would need to have access to 
>source code or plugin related to point 2 (and some part of point 3). 
>Unfortunately, the job_submit (lua) plugin from point 1 (and the cli_filter 
>plugin as well) cannot be used because it only has access to the information 
>on the parameters of the submitted job and the SLURM partitions (but not the 
>QoS/account usage and their limits).

Is there any way to extend the customization of job submission to include such 
features?

Best regards,
Sebastian
--
dr inż. Sebastian Sitkiewicz 

Politechnika Wrocławska 
Wrocławskie Centrum Sieciowo-Superkomputerowe
Dział Usług Obliczeniowych
Wyb. Wyspiańskiego 27
50-370 Wrocław 
www.wcss.pl

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to