Good - thanks! > On Jan 18, 2019, at 3:25 PM, Michael Di Domenico <mdidomeni...@gmail.com> > wrote: > > seems to be better now. jobs are running > > On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain <r...@open-mpi.org> wrote: >> >> I have pushed a fix to the v2.2 branch - could you please confirm it? >> >> >>> On Jan 18, 2019, at 2:23 PM, Ralph H Castain <r...@open-mpi.org> wrote: >>> >>> Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm >>> plugin folks seem to be off somewhere for awhile and haven’t been testing >>> it. Sigh. >>> >>> I’ll patch the branch and let you know - we’d appreciate the feedback. >>> Ralph >>> >>> >>>> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico <mdidomeni...@gmail.com> >>>> wrote: >>>> >>>> here's the branches i'm using. i did a git clone on the repo's and >>>> then a git checkout >>>> >>>> [ec2-user@labhead bin]$ cd /hpc/src/pmix/ >>>> [ec2-user@labhead pmix]$ git branch >>>> master >>>> * v2.2 >>>> [ec2-user@labhead pmix]$ cd ../slurm/ >>>> [ec2-user@labhead slurm]$ git branch >>>> * (detached from origin/slurm-18.08) >>>> master >>>> [ec2-user@labhead slurm]$ cd ../ompi/ >>>> [ec2-user@labhead ompi]$ git branch >>>> * (detached from origin/v3.1.x) >>>> master >>>> >>>> >>>> attached is the debug out from the run with the debugging turned on >>>> >>>> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain <r...@open-mpi.org> wrote: >>>>> >>>>> Looks strange. I’m pretty sure Mellanox didn’t implement the event >>>>> notification system in the Slurm plugin, but you should only be trying to >>>>> call it if OMPI is registering a system-level event code - which OMPI 3.1 >>>>> definitely doesn’t do. >>>>> >>>>> If you are using PMIx v2.2.0, then please note that there is a bug in it >>>>> that slipped through our automated testing. I replaced it today with >>>>> v2.2.1 - you probably should update if that’s the case. However, that >>>>> wouldn’t necessarily explain this behavior. I’m not that familiar with >>>>> the Slurm plugin, but you might try adding >>>>> >>>>> PMIX_MCA_pmix_client_event_verbose=5 >>>>> PMIX_MCA_pmix_server_event_verbose=5 >>>>> OMPI_MCA_pmix_base_verbose=10 >>>>> >>>>> to your environment and see if that provides anything useful. >>>>> >>>>>> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico >>>>>> <mdidomeni...@gmail.com> wrote: >>>>>> >>>>>> i compilied pmix slurm openmpi >>>>>> >>>>>> ---pmix >>>>>> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 >>>>>> --disable-debug >>>>>> ---slurm >>>>>> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 >>>>>> --with-pmix=/hpc/pmix/2.2 >>>>>> ---openmpi >>>>>> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external >>>>>> --with-libevent=external --with-slurm=/hpc/slurm/18.08 >>>>>> --with-pmix=/hpc/pmix/2.2 >>>>>> >>>>>> everything seemed to compile fine, but when i do an srun i get the >>>>>> below errors, however, if i salloc and then mpirun it seems to work >>>>>> fine. i'm not quite sure where the breakdown is or how to debug it >>>>>> >>>>>> --- >>>>>> >>>>>> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl >>>>>> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file >>>>>> event/pmix_event_registration.c at line 101 >>>>>> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file >>>>>> event/pmix_event_registration.c at line 101 >>>>>> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file >>>>>> event/pmix_event_registration.c at line 101 >>>>>> -------------------------------------------------------------------------- >>>>>> It looks like MPI_INIT failed for some reason; your parallel process is >>>>>> likely to abort. There are many reasons that a parallel process can >>>>>> fail during MPI_INIT; some of which are due to configuration or >>>>>> environment >>>>>> problems. This failure appears to be an internal failure; here's some >>>>>> additional information (which may only be relevant to an Open MPI >>>>>> developer): >>>>>> >>>>>> ompi_interlib_declare >>>>>> --> Returned "Would block" (-10) instead of "Success" (0) >>>>>> ...snipped... >>>>>> [labcmp6:18355] *** An error occurred in MPI_Init >>>>>> [labcmp6:18355] *** reported by process [140726281390153,15] >>>>>> [labcmp6:18355] *** on a NULL communicator >>>>>> [labcmp6:18355] *** Unknown error >>>>>> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this >>>>>> communicator will now abort, >>>>>> [labcmp6:18355] *** and potentially your MPI job) >>>>>> [labcmp6:18352] *** An error occurred in MPI_Init >>>>>> [labcmp6:18352] *** reported by process [1677936713,12] >>>>>> [labcmp6:18352] *** on a NULL communicator >>>>>> [labcmp6:18352] *** Unknown error >>>>>> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this >>>>>> communicator will now abort, >>>>>> [labcmp6:18352] *** and potentially your MPI job) >>>>>> [labcmp6:18354] *** An error occurred in MPI_Init >>>>>> [labcmp6:18354] *** reported by process [140726281390153,14] >>>>>> [labcmp6:18354] *** on a NULL communicator >>>>>> [labcmp6:18354] *** Unknown error >>>>>> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this >>>>>> communicator will now abort, >>>>>> [labcmp6:18354] *** and potentially your MPI job) >>>>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >>>>>> slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT >>>>>> 2019-01-18T20:03:33 *** >>>>>> [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file >>>>>> event/pmix_event_registration.c at line 101 >>>>>> -------------------------------------------------------------------------- >>>>>> It looks like MPI_INIT failed for some reason; your parallel process is >>>>>> likely to abort. There are many reasons that a parallel process can >>>>>> fail during MPI_INIT; some of which are due to configuration or >>>>>> environment >>>>>> problems. This failure appears to be an internal failure; here's some >>>>>> additional information (which may only be relevant to an Open MPI >>>>>> developer): >>>>>> >>>>>> ompi_interlib_declare >>>>>> --> Returned "Would block" (-10) instead of "Success" (0) >>>>>> -------------------------------------------------------------------------- >>>>>> [labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file >>>>>> event/pmix_event_registration.c at line 101 >>>>>> [labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file >>>>>> event/pmix_event_registration.c at line 101 >>>>>> srun: error: labcmp6: tasks 12-15: Exited with exit code 1 >>>>>> srun: error: labcmp3: tasks 0-3: Killed >>>>>> srun: error: labcmp4: tasks 4-7: Killed >>>>>> srun: error: labcmp5: tasks 8-11: Killed >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> <out.1547849064.gz>_______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://lists.open-mpi.org/mailman/listinfo/users >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users