[OMPI users] MPI_Comm_spawn

2016-09-29 Thread juraj2...@gmail.com
Hello,

I am using MPI_Comm_spawn to dynamically create new processes from single
manager process. Everything works fine when all the processes are running
on the same node. But imposing restriction to run only a single process per
node does not work. Below are the errors produced during multinode
interactive session and multinode sbatch job.

The system I am using is: Linux version 3.10.0-229.el7.x86_64 (
buil...@kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat
4.8.2-16) (GCC) )
I am using Open MPI 2.0.1
Slurm is version 15.08.9

What is preventing my jobs to spawn on multiple nodes? Does slurm requires
some additional configuration to allow it? Is it issue on the MPI side,
does it need to be compiled with some special flag (I have compiled it with
--enable-mpi-fortran=all --with-pmi)?

The code I am launching is here: https://github.com/goghino/dynamicMPI

Manager tries to launch one new process (./manager 1), the error produced
by requesting each process to be located on different node (interactive
session):
$ salloc -N 2
$ cat my_hosts
icsnode37
icsnode38
$ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1
[manager]I'm running MPI 3.1
[manager]Runing on node icsnode37
icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0)
icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0)
[icsnode37:12614] *** Process received signal ***
[icsnode37:12614] Signal: Aborted (6)
[icsnode37:12614] Signal code:  (-6)
[icsnode38:32443] *** Process received signal ***
[icsnode38:32443] Signal: Aborted (6)
[icsnode38:32443] Signal code:  (-6)

The same example as above via sbatch job submission:
$ cat job.sbatch
#!/bin/bash

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1

module load openmpi/2.0.1
srun -n 1 -N 1 ./manager 1

$ cat output.o
[manager]I'm running MPI 3.1
[manager]Runing on node icsnode39
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[icsnode39:9692] *** An error occurred in MPI_Comm_spawn
[icsnode39:9692] *** reported by process [1007812608,0]
[icsnode39:9692] *** on communicator MPI_COMM_SELF
[icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes
[icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[icsnode39:9692] ***and potentially your MPI job)
In: PMI_Abort(50, N/A)
slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 2016-09-26T16:48:20
***
srun: error: icsnode39: task 0: Exited with exit code 50

Thank for any feedback!

Best regards,
Juraj
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] MPI_Comm_spawn

2016-09-29 Thread juraj2...@gmail.com
The solution was to use the "tcp", "sm" and "self" BTLs for the transport
of MPI messages, with TCP restricting only the eth0 interface to
communicate and using ob1 as p2p management layer:

mpirun --mca btl_tcp_if_include eth0 --mca pml ob1 --mca btl tcp,sm,self
-np 1 --hostfile my_hosts ./manager 1

​Thank you for your help!​
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] MPI + system() call + Matlab MEX crashes

2016-10-05 Thread juraj2...@gmail.com
Hello,

I have an application in C++(main.cpp) that is launched with multiple
processes via mpirun. Master process calls matlab via system('matlab
-nosplash -nodisplay -nojvm -nodesktop -r "interface"'), which executes
simple script interface.m that calls mexFunction (mexsolve.cpp) from which
I try to set up communication with the rest of the processes launched at
the beginning together with the master process. When I run the application
as listed below on two different machines I experience:

1) crash at MPI_Init() in the mexFunction() on cluster machine with
Linux 4.4.0-22-generic

2) error in MPI_Send() shown below on local machine with
Linux 3.10.0-229.el7.x86_64
[archimedes:31962] shmem: mmap: an error occurred while determining whether
or not 
/tmp/openmpi-sessions-1007@archimedes_0/58444/1/shared_mem_pool.archimedes
could be created.
[archimedes:31962] create_and_attach: unable to create shared memory BTL
coordinating structure :: size 134217728
[archimedes:31962] shmem: mmap: an error occurred while determining whether
or not 
/tmp/openmpi-sessions-1007@archimedes_0/58444/1/0/vader_segment.archimedes.0
could be created.
[archimedes][[58444,1],0][../../../../../opal/mca/btl/tcp/
btl_tcp_endpoint.c:800:mca_btl_tcp_endpoint_complete_connect] connect() to
 failed: Connection refused (111)

I launch application as following:
mpirun --mca mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 1  -np 2
-npernode 1 ./main

I have openmpi-2.0.1 configured with --prefix=${INSTALLDIR}
--enable-mpi-fortran=all --with-pmi --disable-dlopen

For more details, the code is here: https://github.com/goghino/matlabMpiC

Thanks for any suggestions!

Juraj
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users