Roberto Fichera ha scritto:
> Hi All on the list,
>
> I'm trying to execute dynamic MPI applications using MPI_Comm_spawn().
> The application I'm using for tests, basically is
> composed by a master, which spawn a slave in each assigned node in a
> multithreading fashion. The master is started with a
> number of jobs to perform and a filename, containing the list of
> assigned nodes. The idea is to handle all the dispatching
> logic within the application, so that the master will try to take as
> busy as possible each assigned node. Said that, for each spawned
> job, the master allocate a thread for spawning and handling the
> communication, than generate a random number, send it to the
> slave which simply send it back to the master. Finally the slave
> terminate its job and the relative node become free for a new one.
> The things will continue until all the requested jobs are done.
>
> The test program I'm using *doesn't* work flawless in mpich2 because it
> has a ~24k spawned job limitation, due to a monotonically
> increasing of its internal context id which basically stops the
> application due to a library internal overflow. The internal context id,
> allocated
> for each terminated spawned job, are never recycled at moment. The
> unique MPI-2 implementation, so supporting MPI_Comm_spawn(),
> which was able to complete the test is currently the HP MPI. So now I
> would start to check OpenMPI if it's suitable for our dynamic parallel
> applications.
>
> The test application is linked against OpenMPI v1.3a1r19645, running of
> Fedora8 x86_64 + all updates.
>
> My first attempt end up on the error below which I basically don't know
> where to look further. Note that I've already checked PATHs and
> LD_LIBRARY_PATH, the application is basically configured correctly since
> it uses two scripts for starting and all the paths are set there.
> Basically I need to start *one* master application which will handle all
> the things for managing slave applications. The communication is *only*
> master <-> slave and never collective, at moment.
>
> The test program is available on request.
>
> Does any one have an idea what's going on?
>
> Thanks in advance,
> Roberto Fichera.
>
> [roberto@cluster4 TestOpenMPI]$ orterun -wdir /data/roberto/MPI/TestOpenMPI 
> -np
> 1 testmaster 10000 $PBS_NODEFILE
> Initializing MPI ...
> Loading the node's ring from file '/var/torque/aux//909.master.tekno-soft.it'
> ... adding node #1 host is 'cluster3.tekno-soft.it'
> ... adding node #2 host is 'cluster2.tekno-soft.it'
> ... adding node #3 host is 'cluster1.tekno-soft.it'
> ... adding node #4 host is 'master.tekno-soft.it'
> A 4 node's ring has been made
> At least one node is available, let's start to distribute 10000 job across 4
> nodes!!!
> ****************** Starting job #1
> ****************** Starting job #2
> ****************** Starting job #3
> ****************** Starting job #4
> Setting up the host as 'cluster3.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task './testslave.sh' on node 'cluster3.tekno-soft.it'
> Setting up the host as 'cluster2.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task './testslave.sh' on node 'cluster2.tekno-soft.it'
> Setting up the host as 'cluster1.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task './testslave.sh' on node 'cluster1.tekno-soft.it'
> Setting up the host as 'master.tekno-soft.it'
> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
> Spawning a task './testslave.sh' on node 'master.tekno-soft.it'
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG: Resource busy in
> file base/plm_base_receive.c at line 169
> [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG: Resource busy in
> file base/plm_base_receive.c at line 169
>   
Just to say that I made a little progress, now seems that everything
starts, but  mpirun doesn't
find the executable

[roberto@cluster4 TestOpenMPI]$ mpirun --verbose --debug-daemons --mca
obl -np 1 -wdir `pwd` testmaster 10000 $PBS_NODEFILE
Daemon was launched on cluster3.tekno-soft.it - beginning to initialize
Daemon was launched on cluster2.tekno-soft.it - beginning to initialize
Daemon was launched on cluster1.tekno-soft.it - beginning to initialize
Daemon [[14600,0],2] checking in as pid 28732 on host cluster2.tekno-soft.it
Daemon [[14600,0],2] not using static ports
[cluster2.tekno-soft.it:28732] [[14600,0],2] orted: up and running -
waiting for commands!
Daemon [[14600,0],3] checking in as pid 2590 on host cluster1.tekno-soft.it
Daemon [[14600,0],3] not using static ports
[cluster1.tekno-soft.it:02590] [[14600,0],3] orted: up and running -
waiting for commands!
Daemon [[14600,0],1] checking in as pid 6969 on host cluster3.tekno-soft.it
Daemon [[14600,0],1] not using static ports
[cluster3.tekno-soft.it:06969] [[14600,0],1] orted: up and running -
waiting for commands!
Daemon was launched on master.tekno-soft.it - beginning to initialize
Daemon [[14600,0],4] checking in as pid 1113 on host master.tekno-soft.it
Daemon [[14600,0],4] not using static ports
[master.tekno-soft.it:01113] [[14600,0],4] orted: up and running -
waiting for commands!
[cluster4.tekno-soft.it:07953] [[14600,0],0] orted_cmd: received
add_local_procs
[cluster4.tekno-soft.it:07953] [[14600,0],0] node[0].name cluster4
daemon 0 arch ffc91200
[cluster4.tekno-soft.it:07953] [[14600,0],0] node[1].name cluster3
daemon 1 arch ffc91200
[cluster4.tekno-soft.it:07953] [[14600,0],0] node[2].name cluster2
daemon 2 arch ffc91200
[cluster4.tekno-soft.it:07953] [[14600,0],0] node[3].name cluster1
daemon 3 arch ffc91200
[cluster4.tekno-soft.it:07953] [[14600,0],0] node[4].name master daemon
4 arch ffc91200
[cluster3.tekno-soft.it:06969] [[14600,0],1] orted_cmd: received
add_local_procs
[cluster2.tekno-soft.it:28732] [[14600,0],2] orted_cmd: received
add_local_procs
[master.tekno-soft.it:01113] [[14600,0],4] orted_cmd: received
add_local_procs
[cluster3.tekno-soft.it:06969] [[14600,0],1] node[0].name cluster4
daemon 0 arch ffc91200
[cluster3.tekno-soft.it:06969] [[14600,0],1] node[1].name cluster3
daemon 1 arch ffc91200
[cluster3.tekno-soft.it:06969] [[14600,0],1] node[2].name cluster2
daemon 2 [cluster2.tekno-soft.it:28732] [[14600,0],2] node[0].name
cluster4 daemon 0 arch ffc91200
[cluster2.tekno-soft.it:28732] [[14600,0],2] node[1].name cluster3
daemon 1 arch ffc91200
[cluster2.tekno-soft.it:28732] [[14600,0],2] node[2].name cluster2
daemon 2 [master.tekno-soft.it:01113] [[14600,0],4] node[0].name
cluster4 daemon 0 arch ffc91200
[master.tekno-soft.it:01113] [[14600,0],4] node[1].name cluster3 daemon
1 arch ffc91200
[master.tekno-soft.it:01113] [[14600,0],4] node[2].name cluster2 daemon
2 arch farch ffc91200
[cluster3.tekno-soft.it:06969] [[14600,0],1] node[3].name cluster1
daemon 3 arch ffc91200
[cluster3.tekno-soft.it:06969] [[14600,0],1] node[4].name master daemon
4 arch ffc91200
arch ffc91200
[cluster2.tekno-soft.it:28732] [[14600,0],2] node[3].name cluster1
daemon 3 arch ffc91200
[cluster2.tekno-soft.it:28732] [[14600,0],2] node[4].name master daemon
4 arch ffc91200
fc91200
[master.tekno-soft.it:01113] [[14600,0],4] node[3].name cluster1 daemon
3 arch ffc91200
[master.tekno-soft.it:01113] [[14600,0],4] node[4].name master daemon 4
arch ffc91200
--------------------------------------------------------------------------
mpirun was unable to launch the specified application as it could not
find an executable:

Executable: 1
Node: cluster4.tekno-soft.it

while attempting to start process rank 0.
--------------------------------------------------------------------------
[master.tekno-soft.it:01113] [[14600,0],4] orted_cmd: received exit
[master.tekno-soft.it:01113] [[14600,0],4] orted: finalizing
[cluster2.tekno-soft.it:28732] [[14600,0],2] orted_cmd: received exit
[cluster2.tekno-soft.it:28732] [[14600,0],2] orted: finalizing
[master:01113] *** Process received signal ***
[cluster2:28732] *** Process received signal ***
[cluster2:28732] Signal: Segmentation fault (11)
[cluster2:28732] Signal code: Address not mapped (1)
[cluster2:28732] Failing at address: 0x2aaaab784af0
[master:01113] Signal: Segmentation fault (11)
[master:01113] Signal code: Address not mapped (1)
[master:01113] Failing at address: 0x2aaaab786af0
mpirun: abort is already in progress...hit ctrl-c again to forcibly
terminate

[cluster1.tekno-soft.it:02590] [[14600,0],3] routed:binomial: Connection
to lifeline [[14600,0],0] lost
[roberto@cluster4 TestOpenMPI]$     

>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>   

Reply via email to