Good - thanks!

> On Jan 18, 2019, at 3:25 PM, Michael Di Domenico <mdidomeni...@gmail.com> 
> wrote:
> 
> seems to be better now.  jobs are running
> 
> On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain <r...@open-mpi.org> wrote:
>> 
>> I have pushed a fix to the v2.2 branch - could you please confirm it?
>> 
>> 
>>> On Jan 18, 2019, at 2:23 PM, Ralph H Castain <r...@open-mpi.org> wrote:
>>> 
>>> Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm 
>>> plugin folks seem to be off somewhere for awhile and haven’t been testing 
>>> it. Sigh.
>>> 
>>> I’ll patch the branch and let you know - we’d appreciate the feedback.
>>> Ralph
>>> 
>>> 
>>>> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico <mdidomeni...@gmail.com> 
>>>> wrote:
>>>> 
>>>> here's the branches i'm using.  i did a git clone on the repo's and
>>>> then a git checkout
>>>> 
>>>> [ec2-user@labhead bin]$ cd /hpc/src/pmix/
>>>> [ec2-user@labhead pmix]$ git branch
>>>> master
>>>> * v2.2
>>>> [ec2-user@labhead pmix]$ cd ../slurm/
>>>> [ec2-user@labhead slurm]$ git branch
>>>> * (detached from origin/slurm-18.08)
>>>> master
>>>> [ec2-user@labhead slurm]$ cd ../ompi/
>>>> [ec2-user@labhead ompi]$ git branch
>>>> * (detached from origin/v3.1.x)
>>>> master
>>>> 
>>>> 
>>>> attached is the debug out from the run with the debugging turned on
>>>> 
>>>> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain <r...@open-mpi.org> wrote:
>>>>> 
>>>>> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
>>>>> notification system in the Slurm plugin, but you should only be trying to 
>>>>> call it if OMPI is registering a system-level event code - which OMPI 3.1 
>>>>> definitely doesn’t do.
>>>>> 
>>>>> If you are using PMIx v2.2.0, then please note that there is a bug in it 
>>>>> that slipped through our automated testing. I replaced it today with 
>>>>> v2.2.1 - you probably should update if that’s the case. However, that 
>>>>> wouldn’t necessarily explain this behavior. I’m not that familiar with 
>>>>> the Slurm plugin, but you might try adding
>>>>> 
>>>>> PMIX_MCA_pmix_client_event_verbose=5
>>>>> PMIX_MCA_pmix_server_event_verbose=5
>>>>> OMPI_MCA_pmix_base_verbose=10
>>>>> 
>>>>> to your environment and see if that provides anything useful.
>>>>> 
>>>>>> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico 
>>>>>> <mdidomeni...@gmail.com> wrote:
>>>>>> 
>>>>>> i compilied pmix slurm openmpi
>>>>>> 
>>>>>> ---pmix
>>>>>> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
>>>>>> --disable-debug
>>>>>> ---slurm
>>>>>> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
>>>>>> --with-pmix=/hpc/pmix/2.2
>>>>>> ---openmpi
>>>>>> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
>>>>>> --with-libevent=external --with-slurm=/hpc/slurm/18.08
>>>>>> --with-pmix=/hpc/pmix/2.2
>>>>>> 
>>>>>> everything seemed to compile fine, but when i do an srun i get the
>>>>>> below errors, however, if i salloc and then mpirun it seems to work
>>>>>> fine.  i'm not quite sure where the breakdown is or how to debug it
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
>>>>>> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
>>>>>> event/pmix_event_registration.c at line 101
>>>>>> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
>>>>>> event/pmix_event_registration.c at line 101
>>>>>> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
>>>>>> event/pmix_event_registration.c at line 101
>>>>>> --------------------------------------------------------------------------
>>>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>>> fail during MPI_INIT; some of which are due to configuration or 
>>>>>> environment
>>>>>> problems.  This failure appears to be an internal failure; here's some
>>>>>> additional information (which may only be relevant to an Open MPI
>>>>>> developer):
>>>>>> 
>>>>>> ompi_interlib_declare
>>>>>> --> Returned "Would block" (-10) instead of "Success" (0)
>>>>>> ...snipped...
>>>>>> [labcmp6:18355] *** An error occurred in MPI_Init
>>>>>> [labcmp6:18355] *** reported by process [140726281390153,15]
>>>>>> [labcmp6:18355] *** on a NULL communicator
>>>>>> [labcmp6:18355] *** Unknown error
>>>>>> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
>>>>>> communicator will now abort,
>>>>>> [labcmp6:18355] ***    and potentially your MPI job)
>>>>>> [labcmp6:18352] *** An error occurred in MPI_Init
>>>>>> [labcmp6:18352] *** reported by process [1677936713,12]
>>>>>> [labcmp6:18352] *** on a NULL communicator
>>>>>> [labcmp6:18352] *** Unknown error
>>>>>> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
>>>>>> communicator will now abort,
>>>>>> [labcmp6:18352] ***    and potentially your MPI job)
>>>>>> [labcmp6:18354] *** An error occurred in MPI_Init
>>>>>> [labcmp6:18354] *** reported by process [140726281390153,14]
>>>>>> [labcmp6:18354] *** on a NULL communicator
>>>>>> [labcmp6:18354] *** Unknown error
>>>>>> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
>>>>>> communicator will now abort,
>>>>>> [labcmp6:18354] ***    and potentially your MPI job)
>>>>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>>>>> slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 
>>>>>> 2019-01-18T20:03:33 ***
>>>>>> [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
>>>>>> event/pmix_event_registration.c at line 101
>>>>>> --------------------------------------------------------------------------
>>>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>>> fail during MPI_INIT; some of which are due to configuration or 
>>>>>> environment
>>>>>> problems.  This failure appears to be an internal failure; here's some
>>>>>> additional information (which may only be relevant to an Open MPI
>>>>>> developer):
>>>>>> 
>>>>>> ompi_interlib_declare
>>>>>> --> Returned "Would block" (-10) instead of "Success" (0)
>>>>>> --------------------------------------------------------------------------
>>>>>> [labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file
>>>>>> event/pmix_event_registration.c at line 101
>>>>>> [labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file
>>>>>> event/pmix_event_registration.c at line 101
>>>>>> srun: error: labcmp6: tasks 12-15: Exited with exit code 1
>>>>>> srun: error: labcmp3: tasks 0-3: Killed
>>>>>> srun: error: labcmp4: tasks 4-7: Killed
>>>>>> srun: error: labcmp5: tasks 8-11: Killed
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> <out.1547849064.gz>_______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to