Thanks, Ralph!

Your code finishes normally, I guess then the reason might be lying in R. Running the R code with -mca pmix_base_verbose 1 i see that each rank calls ext2x:client disconnect twice (each PID prints the line twice)

[...]
    3 slaves are spawned successfully. 0 failed.
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect

In your example it's only called once per process.

Do you have any suspicion where the second call comes from? Might this be the reason for the hang?

Thanks!

Marcin


On 06/04/2018 03:16 PM, r...@open-mpi.org wrote:
Try running the attached example dynamic code - if that works, then it likely 
is something to do with how R operates.





On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski <marcin.krotkiew...@gmail.com> 
wrote:

Hi,

I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A simple 
R script, which starts a few tasks, hangs at the end on diconnect. Here is the 
script:

library(parallel)
numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
myCluster <- makeCluster(numWorkers, type = "MPI")
stopCluster(myCluster)

And here is how I run it:

SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R 
--slave < mk.R

Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned by 
R dynamically inside the script. So I ran into a number of issues here:

1. with HPCX it seems that dynamic starting of ranks is not supported, hence I 
had to turn off all of yalla/mxm/hcoll

--------------------------------------------------------------------------
Your application has invoked an MPI function that is not supported in
this environment.

   MPI function: MPI_Comm_spawn
   Reason:       the Yalla (MXM) PML does not support MPI dynamic process 
functionality
--------------------------------------------------------------------------

2. when I do that, the program does create a 'cluster' and starts the ranks, 
but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:

#0  0x00007f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x00007f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20, 
nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at 
client/pmix_client_connect.c:232
#2  0x00007f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at 
ext2x_client.c:1432
#3  0x00007f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at dpm/dpm.c:596
#4  0x00007f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at 
pcomm_disconnect.c:67
#5  0x00007f66a16799e9 in mpi_comm_disconnect () from 
/cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
#6  0x00007f66b2563de5 in do_dotcall () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#7  0x00007f66b25a207b in bcEval () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#8  0x00007f66b25b0fd0 in Rf_eval.localalias.34 () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#9  0x00007f66b25b2c62 in R_execClosure () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so

Might this also be related to the dynamic rank creation in R?

Thanks!

Marcin

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to