[slurm-users] Errors after removing partition

2019-07-26 Thread Brian Andrus
All,

I have a cloud based cluster using slurm 19.05.0-1
I removed one of the partitions, but now everytime I start slurmctld I get
some errors:

slurmctld[63042]: error: Invalid partition (mpi-h44rs) for JobId=52545
slurmctld[63042]: error: _find_node_record(756): lookup failure for
mpi-h44rs-01
slurmctld[63042]: error: node_name2bitmap: invalid node specified
mpi-h44rs-01
.
.
slurmctld[63042]: error: _find_node_record(756): lookup failure for
mpi-h44rs-05
slurmctld[63042]: error: node_name2bitmap: invalid node specified
mpi-h44rs-05
slurmctld[63042]: error: Invalid nodes (mpi-h44rs-[01-05]) for JobId=52545

I suspect this is in the saved state directory and if I were to down the
entire cluster and delete those files up, it would clear it up, but I
prefer to not have to down the cluster...

Is there a way to clean up "phantom" nodes and partitions that were deleted?

Brian Andrus


Re: [slurm-users] Errors after removing partition

2019-07-26 Thread Jeffrey Frey
If you check the source code (src/slurmctld/job_mgr.c) this error is indeed 
thrown when slurmctl unpacks job state files.  Tracing through 
read_slurm_conf() -> load_all_job_state() -> _load_job_state():


part_ptr = find_part_record (partition);
if (part_ptr == NULL) {
char *err_part = NULL;
part_ptr_list = get_part_list(partition, &err_part);
if (part_ptr_list) {
part_ptr = list_peek(part_ptr_list);
if (list_count(part_ptr_list) == 1)
FREE_NULL_LIST(part_ptr_list);
} else {
verbose("Invalid partition (%s) for JobId=%u",
err_part, job_id);
xfree(err_part);
/* not fatal error, partition could have been
 * removed, reset_job_bitmaps() will clean-up
 * this job */
}
}


The comment after the error implies that this is not really a problem, and that 
it occurs specifically when a partition has been removed.




> On Jul 26, 2019, at 11:15 AM, Brian Andrus  wrote:
> 
> All,
> 
> I have a cloud based cluster using slurm 19.05.0-1
> I removed one of the partitions, but now everytime I start slurmctld I get 
> some errors:
> 
> slurmctld[63042]: error: Invalid partition (mpi-h44rs) for JobId=52545
> slurmctld[63042]: error: _find_node_record(756): lookup failure for 
> mpi-h44rs-01
> slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-01
> .
> .
> slurmctld[63042]: error: _find_node_record(756): lookup failure for 
> mpi-h44rs-05
> slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-05
> slurmctld[63042]: error: Invalid nodes (mpi-h44rs-[01-05]) for JobId=52545
> 
> I suspect this is in the saved state directory and if I were to down the 
> entire cluster and delete those files up, it would clear it up, but I prefer 
> to not have to down the cluster...
> 
> Is there a way to clean up "phantom" nodes and partitions that were deleted?
> 
> Brian Andrus 


::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::






Re: [slurm-users] Errors after removing partition

2019-07-26 Thread Jodie H. Sprouse
fyi… Joe is there now staining front entrance & fixing a few minor touchups, 
nailing baseboard in basement…
Lock box is on the house now w/ key in it…


On Jul 26, 2019, at 11:28 AM, Jeffrey Frey 
mailto:f...@udel.edu>> wrote:

If you check the source code (src/slurmctld/job_mgr.c) this error is indeed 
thrown when slurmctl unpacks job state files.  Tracing through 
read_slurm_conf() -> load_all_job_state() -> _load_job_state():


part_ptr = find_part_record (partition);
if (part_ptr == NULL) {
char *err_part = NULL;
part_ptr_list = get_part_list(partition, &err_part);
if (part_ptr_list) {
part_ptr = list_peek(part_ptr_list);
if (list_count(part_ptr_list) == 1)
FREE_NULL_LIST(part_ptr_list);
} else {
verbose("Invalid partition (%s) for JobId=%u",
err_part, job_id);
xfree(err_part);
/* not fatal error, partition could have been
 * removed, reset_job_bitmaps() will clean-up
 * this job */
}
}


The comment after the error implies that this is not really a problem, and that 
it occurs specifically when a partition has been removed.




On Jul 26, 2019, at 11:15 AM, Brian Andrus 
mailto:toomuc...@gmail.com>> wrote:

All,

I have a cloud based cluster using slurm 19.05.0-1
I removed one of the partitions, but now everytime I start slurmctld I get some 
errors:

slurmctld[63042]: error: Invalid partition (mpi-h44rs) for JobId=52545
slurmctld[63042]: error: _find_node_record(756): lookup failure for mpi-h44rs-01
slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-01
.
.
slurmctld[63042]: error: _find_node_record(756): lookup failure for mpi-h44rs-05
slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-05
slurmctld[63042]: error: Invalid nodes (mpi-h44rs-[01-05]) for JobId=52545

I suspect this is in the saved state directory and if I were to down the entire 
cluster and delete those files up, it would clear it up, but I prefer to not 
have to down the cluster...

Is there a way to clean up "phantom" nodes and partitions that were deleted?

Brian Andrus


::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::







Re: [slurm-users] [Long] Why are tasks started on a 30 second clock?

2019-07-26 Thread Kirill Katsnelson
On Thu, Jul 25, 2019 at 10:20 PM Benjamin Redling <
benjamin.ra...@uni-jena.de> wrote:

> If the 30s delay is only for jobs after the first full queue than it is
> backfill in action?
>

I'm certain this is not the backfill. I see the same behavior when I boot
the controller with all nodes in idle+power-save, and then submit an array.
>From the logs, each array job is assigned to a node immediately, the node
is told to power up, and all backfill debug messages since then say "no
jobs to backfill". All nodes are in alloc+powering-up state, all jobs of
the array are CF and have the same timestamp in squeue. But when the nodes
boot and come knocking to the controller, the symmetry is broken and the
jobs transition from CF to R in these curious bunches 30s apart.


> bf_interval=#
>

Incidentally, set to 5 in my configuration. But thanks for the idea, I'll
search for all "30"-s I can find in all the docs. :-)

 -kkm


Re: [slurm-users] Errors after removing partition

2019-07-26 Thread Chris Samuel

On 26/7/19 8:28 am, Jeffrey Frey wrote:

If you check the source code (src/slurmctld/job_mgr.c) this error is 
indeed thrown when slurmctl unpacks job state files.  Tracing through 
read_slurm_conf() -> load_all_job_state() -> _load_job_state():


I don't think that's the actual error that Brian is seeing, as that's 
just a "verbose()" message (as are another 3 of the 5 instances of 
this).  The only one that's actually an error is this one:


https://github.com/SchedMD/slurm/blob/slurm-19.05/src/slurmctld/job_mgr.c#L11002

in this function:

 * reset_job_bitmaps - reestablish bitmaps for existing jobs.
 *  this should be called after rebuilding node information,
 *  but before using any job entries.

It looks like it should mark these jobs as failed, is that the case Brian?

Brian: when you removed the partition did you restart slurmctld or just 
do an scontrol reconfigure?


BTW that check was introduced in 2003 by Moe :-)

https://github.com/SchedMD/slurm/commit/1c7ee080a48aa6338d3fc5480523017d4287dc08

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA