On Sep 21, 2023, at 9:46 AM, David <dr...@umich.edu<mailto:dr...@umich.edu>> 
wrote:

Slurm is working as it should. From your own examples you proved that; by not 
submitting to b4 the job works. However, looking at man sbatch:

       -p, --partition=<partition_names>
              Request  a  specific partition for the resource allocation.  If 
not specified, the default behavior is to allow the slurm controller to select
              the default partition as designated by the system administrator. 
If the job can use more than one partition, specify their names  in  a  comma
              separate  list and the one offering earliest initiation will be 
used with no regard given to the partition name ordering (although higher pri‐
              ority partitions will be considered first).  When the job is 
initiated, the name of the partition used will be placed first in the job  
record
              partition string.

In your example, the job can NOT use more than one partition (given the 
restrictions defined on the partition itself precluding certain accounts from 
using it). This, to me, seems either like a user education issue (i.e. don't 
have them submit to every partition), or you can try the job submit lua route - 
or perhaps the hidden partition route (which I've not tested).

That's not at all how I interpreted this man page description.  By "If the job 
can use more than..." I thought it was completely obvious (although perhaps 
wrong, if your interpretation is correct, but it never crossed my mind) that it 
referred to whether the _submitting user_ is OK with it using more than one 
partition. The partition where the user is forbidden (because of the 
partition's allowed account) should just be _not_ the earliest initiation 
(because it'll never initiate there), and therefore not run there, but still be 
able to run on the other partitions listed in the batch script.

I think it's completely counter-intuitive that submitting saying it's OK to run 
on one of a few partitions, and one partition happening to be forbidden to the 
submitting user, means that it won't run at all.  What if you list multiple 
partitions, and increase the number of nodes so that there aren't enough in one 
of the partitions, but not realize this problem?  Would you expect that to 
prevent the job from ever running on any partition?

Noam

Reply via email to