Will this fix be in 14.11.3?
(The system is in the customer's hands, so my ability to test it is
limited.)
Andy
On 12/21/2014 10:59 AM, Artem Polyakov
wrote:
Re: [slurm-dev] slurmstepd: mpi/pmi2: invalid kvs seq from
srun
Hello, Andy!
I think this is the race condition that
I found and fixed few weeks ago:
https://github.com/SchedMD/slurm/commit/bc190aee6f517eefe11c01f455fe4fde73ba8323
There was detailed discussion of it in
bugzilla
http://bugs.schedmd.com/show_bug.cgi?id=1302
Here is the image that demonstrates the
case:
http://bugs.schedmd.com/attachment.cgi?id=1490
Could you port/try this patch?
2014-12-21 21:45 GMT+06:00 Andy Riebs
<andy.ri...@hp.com>:
We are sporadically seeing messages such as these when
running on more than 1000 nodes:
slurmstepd: mpi/pmi2: invalid kvs seq from srun, expect 3
got 2
or
slurmstepd: mpi/pmi2: invalid kvs seq from srun, expect 3
got 2
We have seen this with both SHMEM and BUPC jobs, but we
see it in perhaps 3 or 4 runs out of a hundred. A
typically job might look like
salloc -N1280
srun foo
srun foo
srun foo
On those rare occasions where one of the first two steps
fail, the remaining steps will run just fine, so we know
it's not a question of which nodes are used at each step.
The environment:
* RHEL 6.5
* Slurm 14.11.1
* SHMEM provided by OpenMPI 1.8.4rc1
* Berkeley UPC 2.18.0, built on OpenMPI
The only thing unusual in slurm.conf is MpiDefault=pmi2
(which is probably obvious from the messages).
Any ideas?
Andy
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP
--
С Уважением, Поляков Артем
Юрьевич
Best regards, Artem Y. Polyakov