Hi All on the list,

I'm trying to execute dynamic MPI applications using MPI_Comm_spawn().
The application I'm using for tests, basically is
composed by a master, which spawn a slave in each assigned node in a
multithreading fashion. The master is started with a
number of jobs to perform and a filename, containing the list of
assigned nodes. The idea is to handle all the dispatching
logic within the application, so that the master will try to take as
busy as possible each assigned node. Said that, for each spawned
job, the master allocate a thread for spawning and handling the
communication, than generate a random number, send it to the
slave which simply send it back to the master. Finally the slave
terminate its job and the relative node become free for a new one.
The things will continue until all the requested jobs are done.

The test program I'm using *doesn't* work flawless in mpich2 because it
has a ~24k spawned job limitation, due to a monotonically
increasing of its internal context id which basically stops the
application due to a library internal overflow. The internal context id,
allocated
for each terminated spawned job, are never recycled at moment. The
unique MPI-2 implementation, so supporting MPI_Comm_spawn(),
which was able to complete the test is currently the HP MPI. So now I
would start to check OpenMPI if it's suitable for our dynamic parallel
applications.

The test application is linked against OpenMPI v1.3a1r19645, running of
Fedora8 x86_64 + all updates.

My first attempt end up on the error below which I basically don't know
where to look further. Note that I've already checked PATHs and
LD_LIBRARY_PATH, the application is basically configured correctly since
it uses two scripts for starting and all the paths are set there.
Basically I need to start *one* master application which will handle all
the things for managing slave applications. The communication is *only*
master <-> slave and never collective, at moment.

The test program is available on request.

Does any one have an idea what's going on?

Thanks in advance,
Roberto Fichera.

[roberto@cluster4 TestOpenMPI]$ orterun -wdir /data/roberto/MPI/TestOpenMPI -np
1 testmaster 10000 $PBS_NODEFILE
Initializing MPI ...
Loading the node's ring from file '/var/torque/aux//909.master.tekno-soft.it'
... adding node #1 host is 'cluster3.tekno-soft.it'
... adding node #2 host is 'cluster2.tekno-soft.it'
... adding node #3 host is 'cluster1.tekno-soft.it'
... adding node #4 host is 'master.tekno-soft.it'
A 4 node's ring has been made
At least one node is available, let's start to distribute 10000 job across 4
nodes!!!
****************** Starting job #1
****************** Starting job #2
****************** Starting job #3
****************** Starting job #4
Setting up the host as 'cluster3.tekno-soft.it'
Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
Spawning a task './testslave.sh' on node 'cluster3.tekno-soft.it'
Setting up the host as 'cluster2.tekno-soft.it'
Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
Spawning a task './testslave.sh' on node 'cluster2.tekno-soft.it'
Setting up the host as 'cluster1.tekno-soft.it'
Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
Spawning a task './testslave.sh' on node 'cluster1.tekno-soft.it'
Setting up the host as 'master.tekno-soft.it'
Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
Spawning a task './testslave.sh' on node 'master.tekno-soft.it'
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
[cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG: Resource busy in
file base/plm_base_receive.c at line 169
[cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG: Resource busy in
file base/plm_base_receive.c at line 169





Reply via email to