Hi Ralph,
Thanks for the fast reply! I put the --display-allocation and --
display-map flags on and it looks like the nodes allocation is just
fine, but the job still hang.
The output looks like this:
/home/kaisong/test
node0001
node0001
node0000
node0000
Starting parallel job
====================== ALLOCATED NODES ======================
Data for node: Name: node0001 Num slots: 2 Max slots: 0
Data for node: Name: node0000 Num slots: 2 Max slots: 0
=================================================================
======================== JOB MAP ========================
Data for node: Name: node0001 Num procs: 2
Process OMPI jobid: [16591,1] Process rank: 0
Process OMPI jobid: [16591,1] Process rank: 1
Data for node: Name: node0000 Num procs: 2
Process OMPI jobid: [16591,1] Process rank: 2
Process OMPI jobid: [16591,1] Process rank: 3
=============================================================
(no hello wrold output, job just hang here until timeout).
And similar thing in the error output:
node0000 - daemon did not report back when launched
Then, I ran the job manually by adding "-mca btl gm" flag for mpirun:
/home/software/ompi/1.3.2-pgi/bin/mpirun -mca gm --display-
allocation --display-map -v -machinefile ./node -np 4 ./hello-hostname
MPI crashed with the following output/error:
====================== ALLOCATED NODES ======================
Data for node: Name: hbar.lbl.gov Num slots: 0 Max slots: 0
Data for node: Name: node0045 Num slots: 4 Max slots: 0
Data for node: Name: node0046 Num slots: 4 Max slots: 0
Data for node: Name: node0047 Num slots: 4 Max slots: 0
Data for node: Name: node0048 Num slots: 4 Max slots: 0
=================================================================
======================== JOB MAP ========================
Data for node: Name: node0045 Num procs: 4
Process OMPI jobid: [62741,1] Process rank: 0
Process OMPI jobid: [62741,1] Process rank: 1
Process OMPI jobid: [62741,1] Process rank: 2
Process OMPI jobid: [62741,1] Process rank: 3
=============================================================
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[62741,1],1]) is on host: node0045
Process 2 ([[62741,1],1]) is on host: node0045
BTLs attempted: gm
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process
is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[node0045:366] Abort before MPI_INIT completed successfully; not able
to guarantee that all other process
!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[node0045:367] Abort before MPI_INIT completed successfully; not able
to guarantee that all other process
!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[node0045:368] Abort before MPI_INIT completed successfully; not able
to guarantee that all other process
!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[node0045:365] Abort before MPI_INIT completed successfully; not able
to guarantee that all other process
!
--------------------------------------------------------------------------
mpirun has exited due to process rank 3 with PID 368 on
node node0045 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[hbar.lbl.gov:07770] 3 more processes have sent help message help-mca-
bml-r2.txt / unreachable proc
[hbar.lbl.gov:07770] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages
[hbar.lbl.gov:07770] 3 more processes have sent help message help-mpi-
runtime / mpi_init:startup:internal
However, it works if I add "self" to the -mca flag:
/home/software/ompi/1.3.2-pgi/bin/mpirun -mca btl gm,tcp,self --
display-allocation --display-map -v -machinefile ./node -np 16 ./
hello-hostname
====================== ALLOCATED NODES ======================
Data for node: Name: node0045 Num slots: 4 Max slots: 0
Data for node: Name: node0046 Num slots: 4 Max slots: 0
Data for node: Name: node0047 Num slots: 4 Max slots: 0
Data for node: Name: node0048 Num slots: 4 Max slots: 0
=================================================================
======================== JOB MAP ========================
Data for node: Name: node0045 Num procs: 4
Process OMPI jobid: [49981,1] Process rank: 0
Process OMPI jobid: [49981,1] Process rank: 1
Process OMPI jobid: [49981,1] Process rank: 2
Process OMPI jobid: [49981,1] Process rank: 3
Data for node: Name: node0046 Num procs: 4
Process OMPI jobid: [49981,1] Process rank: 4
Process OMPI jobid: [49981,1] Process rank: 5
Process OMPI jobid: [49981,1] Process rank: 6
Process OMPI jobid: [49981,1] Process rank: 7
Data for node: Name: node0047 Num procs: 4
Process OMPI jobid: [49981,1] Process rank: 8
Process OMPI jobid: [49981,1] Process rank: 9
Process OMPI jobid: [49981,1] Process rank: 10
Process OMPI jobid: [49981,1] Process rank: 11
Data for node: Name: node0048 Num procs: 4
Process OMPI jobid: [49981,1] Process rank: 12
Process OMPI jobid: [49981,1] Process rank: 13
Process OMPI jobid: [49981,1] Process rank: 14
Process OMPI jobid: [49981,1] Process rank: 15
=============================================================
Hello world from process 13 of 16
Hostname: node0048
Hello world from process 15 of 16
Hostname: node0048
Hello world from process 12 of 16
Hostname: node0048
Hello world from process 3 of 16
Hostname: node0045
Hello world from process 6 of 16
Hostname: node0046
Hello world from process 8 of 16
Hostname: node0047
Hello world from process 0 of 16
Hostname: node0045
Hello world from process 4 of 16
Hostname: node0046
Hello world from process 2 of 16
Hostname: node0045
Hello world from process 5 of 16
Hostname: node0046
Hello world from process 9 of 16
Hostname: node0047
Hello world from process 10 of 16
Hostname: node0047
Hello world from process 11 of 16
Hostname: node0047
Hello world from process 14 of 16
Hostname: node0048
Hello world from process 1 of 16
Hostname: node0045
Hello world from process 7 of 16
Hostname: node0046
So, I suspect it is not the parsing problem of the -machinefile
flag. Somehow the nodes don't communicate with "-mca btl gm" option
on. Do you think it is the compatibility problem with myrinet driver?
Thanks again for you help!
Kai
--------------------
Kai Song
<ks...@lbl.gov> 1.510.486.4894
High Performance Computing Services (HPCS) Intern
Lawrence Berkeley National Laboratory - http://scs.lbl.gov
----- Original Message -----
From: Ralph Castain <r...@open-mpi.org>
Date: Wednesday, July 22, 2009 5:03 pm
Subject: Re: [OMPI users] Open-MPI-1.3.2 compatibility with old
torque?
To: Open MPI Users <us...@open-mpi.org>
Cc: "Song, Kai Song" <ks...@lbl.gov>
mpirun --display-allocation --display-map
Run a batch job that just prints out $PBS_NODEFILE. I'll bet that
it
isn't what we are expecting, and that the problem comes from it.
In a Torque environment, we read that file to get the list of nodes
and #slots/node that are allocated to your job. We then filter that
through any hostfile you provide. So all the nodes have to be in
the
$PBS_NODEFILE, which has to be in the expected format.
I'm a little suspicious, though, because of your reported error. It
sounds like we are indeed trying to launch a daemon on a known
node. I
can only surmise a couple of possible reasons for the failure:
1. this is a node that is not allocated for your use. Was node0006
in
your allocation?? If not, then the launch would fail. This would
indicate we are not parsing the nodefile correctly.
2. if the node is in your allocation, then I would wonder if you
have
a TCP connection between that node and the one where mpirun exists.
Is
there a firewall in the way? Or something that would preclude a
connection? Frankly, I doubt this possibility because it works when
run manually.
My money is on option #1. :-)
If it is #1 and you send me a copy of a sample $PBS_NODEFILE on
your
system, I can create a way to parse it so we can provide support
for
that older version.
Ralph
On Jul 21, 2009, at 4:44 PM, Song, Kai Song wrote:
Hi Ralph,
Thanks a lot for the fast response.
Could you give me more instructions on which command do I put "--
display-allocation" and "--display-map" with? mpirun?
./configure?...>
Also,we have tested that in our PBS script, if we put node=1, the
helloworld works. But, when I put node=2 or more, it will hang
until
timeout . And the error message will be something like:
node0006 - daemon did not report back when launched
However, if we don't go through the scheduler and run mpi
manually,
everything works fine too.
/home/software/ompi/1.3.2-pgi/bin/mpirun -machinefile ./nodes -np
16 ./a.out
What do you think the problem would be? It's not the network
issue,
because manually running MPI works. That is why we question about
torque compatibility.
Thanks again,
Kai
--------------------
Kai Song
<ks...@lbl.gov> 1.510.486.4894
High Performance Computing Services (HPCS) Intern
Lawrence Berkeley National Laboratory - http://scs.lbl.gov
----- Original Message -----
From: Ralph Castain <r...@open-mpi.org>
Date: Tuesday, July 21, 2009 12:12 pm
Subject: Re: [OMPI users] Open-MPI-1.3.2 compatibility with old
torque?
To: Open MPI Users <us...@open-mpi.org>
I'm afraid I have no idea - I've never seen a Torque version
that
old,
however, so it is quite possible that we don't work with it. It
also looks
like it may have been modified (given the p2-aspen3 on the end), so
I have
no idea how the system would behave.
First thing you could do is verify that the allocation is being
read>> correctly. Add a --display-allocation to the cmd line and
see what
we think
Torque gave us. Then add --display-map to see where it plans to
place the
processes.
If all that looks okay, and if you allow ssh, then try -mca plm rsh
on the
cmd line and see if that works.
HTH
Ralph
On Tue, Jul 21, 2009 at 12:57 PM, Song, Kai Song <ks...@lbl.gov>
wrote:
Hi All,
I am building open-mpi-1.3.2 on centos-3.4, with torque-1.1.0p2-
aspen3 and
myrinet. I compiled it just fine with this configuration:
./configure --prefix=/home/software/ompi/1.3.2-pgi --with-
gm=/usr/local/> --with-gm-libdir=/usr/local/lib64/ --enable-
static -
-disable-shared
--with-tm=/usr/ --without-threads CC=pgcc CXX=pgCC FC=pgf90
F77=pgf77> LDFLAGS=-L/usr/lib64/torque/
However, when I submit jobs for 2 or more nodes through the torque
schedular, the jobs just hang here. It shows the RUN state, but no
communication between the nodes, then jobs will die with timeout.
We have comfirmed that the myrinet is working because our lam-
mpi-
7.1 works
just fine. We are having a really hard time determining what are
the causes
for this problem. So, we suspect it's because our torque is too
old.>>>
What is the lowest version requirement of torque for open-mpi-
1.3.2? The
README file didn't specify this detail. Does anyone know more
about it?
Thanks in advance,
Kai
--------------------
Kai Song
<ks...@lbl.gov> 1.510.486.4894
High Performance Computing Services (HPCS) Intern
Lawrence Berkeley National Laboratory - http://scs.lbl.gov
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users