Hello all! I figured I'd share my experience with a recent upgrade from 15.08.4 to 16.5.10-2, which did manifest some oddities, but was still a "textbook" upgrade.
We seemed to have run into symptoms similar to bugs #2319 [https://bugs.schedmd.com/show_bug.cgi?id=2319] and #2002 [https://bugs.schedmd.com/show_bug.cgi?id=2002] (mentioning purely for reference; no nodes were down). I don't exactly remember the count, but I believe there were less than 20 jobs affected, and both the controller and slurmd log snippets are mostly verbatim from the bugs listed above: # controller [2017-07-27T23:37:37.527] error: slurm_receive_msg: Zero Bytes were transmitted or received [2017-07-27T23:37:37.537] error: slurm_receive_msg [10.250.2.166:51260]: Zero Bytes were transmitted or received [2017-07-27T23:37:37.537] error: invalid type trying to be freed 65534 [2017-07-27T23:37:40.210] error: slurm_receive_msg: Zero Bytes were transmitted or received [2017-07-27T23:37:40.220] error: slurm_receive_msg [10.250.2.146:46050]: Zero Bytes were transmitted or received #slurmd [2017-07-27T23:39:11.391] [9861275] cannot create auth context for auth/munge [2017-07-27T23:39:11.391] [9861275] /usr/lib64/slurm/auth_munge.so: Incompatible Slurm plugin version (16.5.10) [2017-07-27T23:39:11.391] [9861275] Couldn't load specified plugin name for auth/munge: Incompatible plugin version [2017-07-27T23:39:11.391] [9861275] cannot create auth context for auth/munge [2017-07-27T23:39:11.391] [9861275] authentication: authentication initialization failure [2017-07-27T23:39:11.391] [9861275] Retrying job complete RPC for 9861275.4294967294 Rather than wait for the timeout to occur on the compute nodes, I opted to HUP each "stuck" slurmstepd process. Once this was done, there were no more errors logged in either the controller logs or the slurmd logs. The only item I can think of is that during the upgrade process on a separate cluster (15.08.4 to 16.5.10-2), the textbook procedure was followed, but after the slurmdbd was upgraded, it failed to start: [2017-07-12T14:14:51.971] /usr/lib64/slurm/auth_munge.so: Incompatible Slurm plugin version (15.8.4) [2017-07-12T14:14:51.971] error: Couldn't load specified plugin name for auth/munge: Incompatible plugin version [2017-07-12T14:14:51.971] error: cannot create auth context for auth/munge [2017-07-12T14:14:51.971] fatal: Unable to initialize auth/munge authentication plugin [2017-07-12T14:15:17.850] adding column max_jobs_pa after grace_time in table qos_table [2017-07-12T14:15:17.850] adding column max_submit_jobs_pa after max_jobs_per_user in table qos_table [2017-07-12T14:15:17.850] adding column max_tres_pa after max_submit_jobs_per_user in table qos_table [2017-07-12T14:15:17.850] adding column max_tres_run_mins_pa after max_tres_mins_pj in table qos_table [2017-07-12T14:15:18.700] Accounting storage MYSQL plugin loaded [2017-07-12T14:15:19.460] slurmdbd version 16.05.10-2 started I had to upgrade the munge plugin before slurmdbd would start. Using this experience, I went ahead and upgraded the munge plugin as well while upgrading the slurmdbd. So, perhaps this was a self-inflected mishap? I figured it would be best to at least get this posted to the list so that others are aware - and that SLURM is still stable! Thanks, John DeSantis
