Thanks for the suggestion. We are using an NFSRoot OS image on all the nodes, so all the nodes have to be running the same version of OMPI.

On 4/27/20 10:58 AM, Riebs, Andy wrote:

Y’know, a quick check on versions and PATHs might be a good idea here. I suggest something like

$ srun  -N3  ompi_info  |&  grep  "MPI repo"

to confirm that all nodes are running the same version of OMPI.

*From:*users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Prentice Bisbal via users
*Sent:* Monday, April 27, 2020 10:25 AM
*To:* users@lists.open-mpi.org
*Cc:* Prentice Bisbal <pbis...@pppl.gov>
*Subject:* Re: [OMPI users] [External] Re: Can't start jobs with srun.

Ralph,

PMI2 support works just fine. It's just PMIx that seems to be the problem.

We rebuilt Slurm with PMIx 3.1.5, but the problem persists. I've opened a ticket with Slurm support to see if it's a problem on Slurm's end.

Prentice

On 4/26/20 2:12 PM, Ralph Castain via users wrote:

    It is entirely possible that the PMI2 support in OMPI v4 is broken
    - I doubt it is used or tested very much as pretty much everyone
    has moved to PMIx. In fact, we completely dropped PMI-1 and PMI-2
    from OMPI v5 for that reason.

    I would suggest building Slurm with PMIx v3.1.5
    (https://github.com/openpmix/openpmix/releases/tag/v3.1.5) as that
    is what OMPI v4 is using, and launching with "srun --mpi=pmix_v3"



        On Apr 26, 2020, at 10:07 AM, Patrick Bégou via users
        <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
        wrote:

        I have also this problem on servers I'm benching at DELL's lab
        with
        OpenMPI-4.0.3. I've tried  a new build of OpenMPI with
        "--with-pmi2". No
        change.
        Finally my work around in the slurm script was to launch my
        code with
        mpirun. As mpirun was only finding one slot per nodes I have used
        "--oversubscribe --bind-to core" and checked that every
        process was
        binded on a separate core. It worked but do not ask me why :-)

        Patrick

        Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :

            Prentice, have you tried something trivial, like "srun -N3
            hostname", to rule out non-OMPI problems?

            Andy

            -----Original Message-----
            From: users [mailto:users-boun...@lists.open-mpi.org] On
            Behalf Of Prentice Bisbal via users
            Sent: Friday, April 24, 2020 2:19 PM
            To: Ralph Castain <r...@open-mpi.org
            <mailto:r...@open-mpi.org>>; Open MPI Users
            <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
            Cc: Prentice Bisbal <pbis...@pppl.gov
            <mailto:pbis...@pppl.gov>>
            Subject: Re: [OMPI users] [External] Re: Can't start jobs
            with srun.

            Okay. I've got Slurm built with pmix support:

            $ srun --mpi=list
            srun: MPI types are...
            srun: none
            srun: pmix_v3
            srun: pmi2
            srun: openmpi
            srun: pmix

            But now when I try to launch a job with srun, the job
            appears to be
            running, but doesn't do anything - it just hangs in the
            running state
            but doesn't do anything. Any ideas what could be wrong, or
            how to debug
            this?

            I'm also asking around on the Slurm mailing list, too

            Prentice

            On 4/23/20 3:03 PM, Ralph Castain wrote:

                You can trust the --mpi=list. The problem is likely
                that OMPI wasn't configured --with-pmi2



                    On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via
                    users <users@lists.open-mpi.org
                    <mailto:users@lists.open-mpi.org>> wrote:

                    --mpi=list shows pmi2 and openmpi as valid values,
                    but if I set --mpi= to either of them, my job
                    still fails. Why is that? Can I not trust the
                    output of --mpi=list?

                    Prentice

                    On 4/23/20 10:43 AM, Ralph Castain via users wrote:

                        No, but you do have to explicitly build OMPI
                        with non-PMIx support if that is what you are
                        going to use. In this case, you need to
                        configure OMPI
                        --with-pmi2=<path-to-the-pmi2-installation>

                        You can leave off the path if Slurm (i.e.,
                        just "--with-pmi2") was installed in a
                        standard location as we should find it there.



                            On Apr 23, 2020, at 7:39 AM, Prentice
                            Bisbal via users <users@lists.open-mpi.org
                            <mailto:users@lists.open-mpi.org>> wrote:

                            It looks like it was built with PMI2, but
                            not PMIx:

                            $ srun --mpi=list
                            srun: MPI types are...
                            srun: none
                            srun: pmi2
                            srun: openmpi

                            I did launch the job with srun --mpi=pmi2 ....

                            Does OpenMPI 4 need PMIx specifically?


                            On 4/23/20 10:23 AM, Ralph Castain via
                            users wrote:

                                Is Slurm built with PMIx support? Did
                                you tell srun to use it?



                                    On Apr 23, 2020, at 7:00 AM,
                                    Prentice Bisbal via users
                                    <users@lists.open-mpi.org
                                    <mailto:users@lists.open-mpi.org>>
                                    wrote:

                                    I'm using OpenMPI 4.0.3 with Slurm
                                    19.05.5  I'm testing the software
                                    with a very simple hello, world
                                    MPI program that I've used
                                    reliably for years. When I submit
                                    the job through slurm and use srun
                                    to launch the job, I get these errors:

                                    *** An error occurred in MPI_Init
                                    *** on a NULL communicator
                                    *** MPI_ERRORS_ARE_FATAL
                                    (processes in this communicator
                                    will now abort,
                                    ***    and potentially your MPI job)
                                    [dawson029.pppl.gov:26070
                                    <http://dawson029.pppl.gov:26070>]
                                    Local abort before MPI_INIT
                                    completed completed successfully,
                                    but am not able to aggregate error
                                    messages, and not able to
                                    guarantee that all other processes
                                    were killed!
                                    *** An error occurred in MPI_Init
                                    *** on a NULL communicator
                                    *** MPI_ERRORS_ARE_FATAL
                                    (processes in this communicator
                                    will now abort,
                                    ***    and potentially your MPI job)
                                    [dawson029.pppl.gov:26076
                                    <http://dawson029.pppl.gov:26076>]
                                    Local abort before MPI_INIT
                                    completed completed successfully,
                                    but am not able to aggregate error
                                    messages, and not able to
                                    guarantee that all other processes
                                    were killed!

                                    If I run the same job, but use
                                    mpiexec or mpirun instead of srun,
                                    the jobs run just fine. I checked
                                    ompi_info to make sure OpenMPI was
                                    compiled with  Slurm support:

                                    $ ompi_info | grep slurm
                                      Configure command line:
                                    
'--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3'
                                    '--disable-silent-rules'
                                    '--enable-shared'
                                    '--with-pmix=internal'
                                    '--with-slurm' '--with-psm'
                                                     MCA ess: slurm
                                    (MCA v2.1.0, API v3.0.0, Component
                                    v4.0.3)
                                                     MCA plm: slurm
                                    (MCA v2.1.0, API v2.0.0, Component
                                    v4.0.3)
                                                     MCA ras: slurm
                                    (MCA v2.1.0, API v2.0.0, Component
                                    v4.0.3)
                                                  MCA schizo: slurm
                                    (MCA v2.1.0, API v1.0.0, Component
                                    v4.0.3)

                                    Any ideas what could be wrong? Do
                                    you need any additional information?

                                    Prentice

--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Reply via email to