Hello all!

I figured I'd share my experience with a recent upgrade from 15.08.4 to 
16.5.10-2, which did
manifest some oddities, but was still a "textbook" upgrade.

We seemed to have run into symptoms similar to bugs #2319
[https://bugs.schedmd.com/show_bug.cgi?id=2319] and #2002
[https://bugs.schedmd.com/show_bug.cgi?id=2002] (mentioning purely for 
reference;  no nodes were
down).

I don't exactly remember the count, but I believe there were less than 20 jobs 
affected, and both
the controller and slurmd log snippets are mostly verbatim from the bugs listed 
above:

# controller
[2017-07-27T23:37:37.527] error: slurm_receive_msg: Zero Bytes were transmitted 
or received
[2017-07-27T23:37:37.537] error: slurm_receive_msg [10.250.2.166:51260]: Zero 
Bytes were transmitted
or received
[2017-07-27T23:37:37.537] error: invalid type trying to be freed 65534
[2017-07-27T23:37:40.210] error: slurm_receive_msg: Zero Bytes were transmitted 
or received
[2017-07-27T23:37:40.220] error: slurm_receive_msg [10.250.2.146:46050]: Zero 
Bytes were transmitted
or received

#slurmd
[2017-07-27T23:39:11.391] [9861275] cannot create auth context for auth/munge
[2017-07-27T23:39:11.391] [9861275] /usr/lib64/slurm/auth_munge.so: 
Incompatible Slurm plugin
version (16.5.10)
[2017-07-27T23:39:11.391] [9861275] Couldn't load specified plugin name for 
auth/munge: Incompatible
plugin version
[2017-07-27T23:39:11.391] [9861275] cannot create auth context for auth/munge
[2017-07-27T23:39:11.391] [9861275] authentication: authentication 
initialization failure
[2017-07-27T23:39:11.391] [9861275] Retrying job complete RPC for 
9861275.4294967294

Rather than wait for the timeout to occur on the compute nodes, I opted to HUP 
each "stuck"
slurmstepd process.  Once this was done, there were no more errors logged in 
either the controller
logs or the slurmd logs.

The only item I can think of is that during the upgrade process on a separate 
cluster (15.08.4 to
16.5.10-2), the textbook procedure was followed, but after the slurmdbd was 
upgraded, it failed to
start:

[2017-07-12T14:14:51.971] /usr/lib64/slurm/auth_munge.so: Incompatible Slurm 
plugin version (15.8.4)
[2017-07-12T14:14:51.971] error: Couldn't load specified plugin name for 
auth/munge: Incompatible
plugin version
[2017-07-12T14:14:51.971] error: cannot create auth context for auth/munge
[2017-07-12T14:14:51.971] fatal: Unable to initialize auth/munge authentication 
plugin
[2017-07-12T14:15:17.850] adding column max_jobs_pa after grace_time in table 
qos_table
[2017-07-12T14:15:17.850] adding column max_submit_jobs_pa after 
max_jobs_per_user in table qos_table
[2017-07-12T14:15:17.850] adding column max_tres_pa after 
max_submit_jobs_per_user in table qos_table
[2017-07-12T14:15:17.850] adding column max_tres_run_mins_pa after 
max_tres_mins_pj in table qos_table
[2017-07-12T14:15:18.700] Accounting storage MYSQL plugin loaded
[2017-07-12T14:15:19.460] slurmdbd version 16.05.10-2 started

I had to upgrade the munge plugin before slurmdbd would start.  Using this 
experience, I went ahead
and upgraded the munge plugin as well while upgrading the slurmdbd.  So, 
perhaps this was a
self-inflected mishap?

I figured it would be best to at least get this posted to the list so that 
others are aware - and
that SLURM is still stable!

Thanks,
John DeSantis

Reply via email to