Will this fix be in 14.11.3?
 
 (The system is in the customer's hands, so my ability to test it is
 limited.)
 
 Andy
 
 On 12/21/2014 10:59 AM, Artem Polyakov
   wrote:
   Re: [slurm-dev] slurmstepd: mpi/pmi2: invalid kvs seq from
     srun
   
   Hello, Andy!
     I think this is the race condition that
       I found and fixed few weeks ago:
     
https://github.com/SchedMD/slurm/commit/bc190aee6f517eefe11c01f455fe4fde73ba8323
     There was detailed discussion of it in
       bugzilla
     http://bugs.schedmd.com/show_bug.cgi?id=1302
     
     Here is the image that demonstrates the
       case:
     http://bugs.schedmd.com/attachment.cgi?id=1490
     Could you port/try this patch?
     
       2014-12-21 21:45 GMT+06:00 Andy Riebs
         <andy.ri...@hp.com>:
         
           We are sporadically seeing messages such as these when
           running on more than 1000 nodes:
           
           slurmstepd: mpi/pmi2: invalid kvs seq from srun, expect 3
           got 2
           
           or
           
           slurmstepd: mpi/pmi2: invalid kvs seq from srun, expect 3
           got 2
           
           We have seen this with both SHMEM and BUPC jobs, but we
           see it in perhaps 3 or 4 runs out of a hundred. A
           typically job might look like
           
           salloc -N1280
               srun foo
               srun foo
               srun foo
           
           On those rare occasions where one of the first two steps
           fail, the remaining steps will run just fine, so we know
           it's not a question of which nodes are used at each step.
           
           The environment:
           * RHEL 6.5
           * Slurm 14.11.1
           * SHMEM provided by OpenMPI 1.8.4rc1
           * Berkeley UPC 2.18.0, built on OpenMPI
           
           The only thing unusual in slurm.conf is MpiDefault=pmi2
           (which is probably obvious from the messages).
           Any ideas?
               Andy
               
               -- 
               Andy Riebs
               Hewlett-Packard Company
               High Performance Computing
               +1 404 648 9024
               My opinions are not necessarily those of HP
       -- 
       С Уважением, Поляков Артем
         Юрьевич
         Best regards, Artem Y. Polyakov

Reply via email to