[slurm-users] Errors after removing partition
All, I have a cloud based cluster using slurm 19.05.0-1 I removed one of the partitions, but now everytime I start slurmctld I get some errors: slurmctld[63042]: error: Invalid partition (mpi-h44rs) for JobId=52545 slurmctld[63042]: error: _find_node_record(756): lookup failure for mpi-h44rs-01 slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-01 . . slurmctld[63042]: error: _find_node_record(756): lookup failure for mpi-h44rs-05 slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-05 slurmctld[63042]: error: Invalid nodes (mpi-h44rs-[01-05]) for JobId=52545 I suspect this is in the saved state directory and if I were to down the entire cluster and delete those files up, it would clear it up, but I prefer to not have to down the cluster... Is there a way to clean up "phantom" nodes and partitions that were deleted? Brian Andrus
Re: [slurm-users] Errors after removing partition
If you check the source code (src/slurmctld/job_mgr.c) this error is indeed thrown when slurmctl unpacks job state files. Tracing through read_slurm_conf() -> load_all_job_state() -> _load_job_state(): part_ptr = find_part_record (partition); if (part_ptr == NULL) { char *err_part = NULL; part_ptr_list = get_part_list(partition, &err_part); if (part_ptr_list) { part_ptr = list_peek(part_ptr_list); if (list_count(part_ptr_list) == 1) FREE_NULL_LIST(part_ptr_list); } else { verbose("Invalid partition (%s) for JobId=%u", err_part, job_id); xfree(err_part); /* not fatal error, partition could have been * removed, reset_job_bitmaps() will clean-up * this job */ } } The comment after the error implies that this is not really a problem, and that it occurs specifically when a partition has been removed. > On Jul 26, 2019, at 11:15 AM, Brian Andrus wrote: > > All, > > I have a cloud based cluster using slurm 19.05.0-1 > I removed one of the partitions, but now everytime I start slurmctld I get > some errors: > > slurmctld[63042]: error: Invalid partition (mpi-h44rs) for JobId=52545 > slurmctld[63042]: error: _find_node_record(756): lookup failure for > mpi-h44rs-01 > slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-01 > . > . > slurmctld[63042]: error: _find_node_record(756): lookup failure for > mpi-h44rs-05 > slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-05 > slurmctld[63042]: error: Invalid nodes (mpi-h44rs-[01-05]) for JobId=52545 > > I suspect this is in the saved state directory and if I were to down the > entire cluster and delete those files up, it would clear it up, but I prefer > to not have to down the cluster... > > Is there a way to clean up "phantom" nodes and partitions that were deleted? > > Brian Andrus :: Jeffrey T. Frey, Ph.D. Systems Programmer V / HPC Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 ::
Re: [slurm-users] Errors after removing partition
fyi… Joe is there now staining front entrance & fixing a few minor touchups, nailing baseboard in basement… Lock box is on the house now w/ key in it… On Jul 26, 2019, at 11:28 AM, Jeffrey Frey mailto:f...@udel.edu>> wrote: If you check the source code (src/slurmctld/job_mgr.c) this error is indeed thrown when slurmctl unpacks job state files. Tracing through read_slurm_conf() -> load_all_job_state() -> _load_job_state(): part_ptr = find_part_record (partition); if (part_ptr == NULL) { char *err_part = NULL; part_ptr_list = get_part_list(partition, &err_part); if (part_ptr_list) { part_ptr = list_peek(part_ptr_list); if (list_count(part_ptr_list) == 1) FREE_NULL_LIST(part_ptr_list); } else { verbose("Invalid partition (%s) for JobId=%u", err_part, job_id); xfree(err_part); /* not fatal error, partition could have been * removed, reset_job_bitmaps() will clean-up * this job */ } } The comment after the error implies that this is not really a problem, and that it occurs specifically when a partition has been removed. On Jul 26, 2019, at 11:15 AM, Brian Andrus mailto:toomuc...@gmail.com>> wrote: All, I have a cloud based cluster using slurm 19.05.0-1 I removed one of the partitions, but now everytime I start slurmctld I get some errors: slurmctld[63042]: error: Invalid partition (mpi-h44rs) for JobId=52545 slurmctld[63042]: error: _find_node_record(756): lookup failure for mpi-h44rs-01 slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-01 . . slurmctld[63042]: error: _find_node_record(756): lookup failure for mpi-h44rs-05 slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-05 slurmctld[63042]: error: Invalid nodes (mpi-h44rs-[01-05]) for JobId=52545 I suspect this is in the saved state directory and if I were to down the entire cluster and delete those files up, it would clear it up, but I prefer to not have to down the cluster... Is there a way to clean up "phantom" nodes and partitions that were deleted? Brian Andrus :: Jeffrey T. Frey, Ph.D. Systems Programmer V / HPC Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 ::
Re: [slurm-users] [Long] Why are tasks started on a 30 second clock?
On Thu, Jul 25, 2019 at 10:20 PM Benjamin Redling < benjamin.ra...@uni-jena.de> wrote: > If the 30s delay is only for jobs after the first full queue than it is > backfill in action? > I'm certain this is not the backfill. I see the same behavior when I boot the controller with all nodes in idle+power-save, and then submit an array. >From the logs, each array job is assigned to a node immediately, the node is told to power up, and all backfill debug messages since then say "no jobs to backfill". All nodes are in alloc+powering-up state, all jobs of the array are CF and have the same timestamp in squeue. But when the nodes boot and come knocking to the controller, the symmetry is broken and the jobs transition from CF to R in these curious bunches 30s apart. > bf_interval=# > Incidentally, set to 5 in my configuration. But thanks for the idea, I'll search for all "30"-s I can find in all the docs. :-) -kkm
Re: [slurm-users] Errors after removing partition
On 26/7/19 8:28 am, Jeffrey Frey wrote: If you check the source code (src/slurmctld/job_mgr.c) this error is indeed thrown when slurmctl unpacks job state files. Tracing through read_slurm_conf() -> load_all_job_state() -> _load_job_state(): I don't think that's the actual error that Brian is seeing, as that's just a "verbose()" message (as are another 3 of the 5 instances of this). The only one that's actually an error is this one: https://github.com/SchedMD/slurm/blob/slurm-19.05/src/slurmctld/job_mgr.c#L11002 in this function: * reset_job_bitmaps - reestablish bitmaps for existing jobs. * this should be called after rebuilding node information, * but before using any job entries. It looks like it should mark these jobs as failed, is that the case Brian? Brian: when you removed the partition did you restart slurmctld or just do an scontrol reconfigure? BTW that check was introduced in 2003 by Moe :-) https://github.com/SchedMD/slurm/commit/1c7ee080a48aa6338d3fc5480523017d4287dc08 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA