I stand corrected. It is only in the master branch, to be released in version 15.08.

Quoting je...@schedmd.com:

Yes, and we'll probably release it in the next week or two.

Quoting Andy Riebs <andy.ri...@hp.com>:

Will this fix be in 14.11.3?

(The system is in the customer's hands, so my ability to test it is
limited.)

Andy

On 12/21/2014 10:59 AM, Artem Polyakov
  wrote:
  Re: [slurm-dev] slurmstepd: mpi/pmi2: invalid kvs seq from
    srun

  Hello, Andy!
    I think this is the race condition that
      I found and fixed few weeks ago:
https://github.com/SchedMD/slurm/commit/bc190aee6f517eefe11c01f455fe4fde73ba8323
    There was detailed discussion of it in
      bugzilla
    http://bugs.schedmd.com/show_bug.cgi?id=1302

    Here is the image that demonstrates the
      case:
    http://bugs.schedmd.com/attachment.cgi?id=1490
    Could you port/try this patch?

      2014-12-21 21:45 GMT+06:00 Andy Riebs
        <andy.ri...@hp.com>:

          We are sporadically seeing messages such as these when
          running on more than 1000 nodes:

          slurmstepd: mpi/pmi2: invalid kvs seq from srun, expect 3
          got 2

          or

          slurmstepd: mpi/pmi2: invalid kvs seq from srun, expect 3
          got 2

          We have seen this with both SHMEM and BUPC jobs, but we
          see it in perhaps 3 or 4 runs out of a hundred. A
          typically job might look like

          salloc -N1280
              srun foo
              srun foo
              srun foo

          On those rare occasions where one of the first two steps
          fail, the remaining steps will run just fine, so we know
          it's not a question of which nodes are used at each step.

          The environment:
          * RHEL 6.5
          * Slurm 14.11.1
          * SHMEM provided by OpenMPI 1.8.4rc1
          * Berkeley UPC 2.18.0, built on OpenMPI

          The only thing unusual in slurm.conf is MpiDefault=pmi2
          (which is probably obvious from the messages).
          Any ideas?
              Andy

              --
              Andy Riebs
              Hewlett-Packard Company
              High Performance Computing
              +1 404 648 9024
              My opinions are not necessarily those of HP
      --
      С Уважением, Поляков Артем
        Юрьевич
        Best regards, Artem Y. Polyakov


--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support


--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Reply via email to