Hi

I am in the process of moving a parallel program from our old 32 bit based 
(Xeon @ 2.8 GHz) Linux cluster to a new EM64T (Intel Xeon  5160  @ 3.00GHz) 
base linux cluster. 

OS on the old cluster is Redhat 9 and Fedora 7 on the new cluster.

I have installed the Intel Fortran compiler version 10.0 and openmpi-1.2.3.

I configured openmpi with “--prefix=/opt/openmpi –F77=ifort –FC=ifort.
config.log and the output from ompi_info --all are in the attached files.



/opt/ is mounted on all nodes in the cluster.

The program causing me problems, is a program that solves two large 
interrelated systems of equations (+200.000.000 eq.) using PCG iteration. The 
program starts to iterate on the first system until a certain degree of 
convergence is reached, then the master node executes a shell script which 
starts the parallel solver on the second system. Again the iteration is 
continued until certain degree of convergence, some parameters from solving the 
second system is stored in different files. After the solving of the second 
system, the stored parameters is used in the solver for the first system. Both 
before and after the master node makes the system call the nodes are 
synchronized via  calls of MPI_BARRIER.

This setup has worked fine on the old cluster, but on the new cluster, The 
system call do not start the parallel solver for the second system. The solver 
program is very complex, so I have med some small Fortran programs and shell 
scripts that illustrates the problem.

The setup is as follows:

mpi_master starts mpi on a number of nodes and checks that the nodes is alive. 
The master then executes the shell script serial.sh via a system call, thats 
starts a serial Fortran program serial_subprog). After return from the system 
call, the master executes the shell script mpi.sh. This script tries to start 
mpi_subprog via mpirun. 

I have used mpif90 to compile the mpi programs and ifort to compile the serial 
program.

Mpi_main starts as expected, the call of serial.sh starts the serial program as 
expected. However, the system call to execute the mpi.sh do not start 
mpi_subprog.

The Fortran programs and scripts are in the attached file test.tar.gz. 


When I run the setup via:
 
mpirun -np 4 -hostfile nodelist ./mpi_main 

I get the following:

MPI_INIT return code:            0
 MPI_INIT return code:            0
 MPI_COMM_RANK return code:            0
 MPI_COMM_SIZE return code:            0
 Process            1  of            2  is alive - Hostname= c01b04
           1  :           19
 MPI_COMM_RANK return code:            0
 MPI_COMM_SIZE return code:            0
 Process            0  of            2  is alive - Hostname= c01b05
           0  :           19
 MYID:            1  MPI_REDUCE 1 red_chk_sum=           0  rc=           0
 MYID:            0  MPI_REDUCE 1 red_chk_sum=           2  rc=           0
 MYID:            1  MPI_BARRIER 1 RC=            0
 MYID:            0  MPI_BARRIER 1 RC=            0

 Master will now execute the shell script serial.sh

This is from serial.sh

 We are now in the serial subprogram

 Master back from the shell script serial.sh
 IERR=            0

 Master will now execute the shell script mpi.sh

This is from mpi.sh
/nav/denmark/navper19/mpi_test
[c01b05.ctrl.ghpc.dk:25337] OOB: Connection to HNP lost

 Master back from the shell script mpi.sh
 IERR=            0

 MYID:            0  MPI_BARRIER 2 RC=            0
 MYID:            0  MPI_REDUCE 2 red_chk_sum=          20  rc=           0
 MYID:            1  MPI_BARRIER 2 RC=            0
 MYID:            1  MPI_REDUCE 2 red_chk_sum=           0  rc=           0

As you can see, the execution on the serial program works, while the mpi 
program is not started.

I have checked that mpirun is in the PATH in the shell started by the system 
call, and I have checked the the mpi.sh script works if it is executed from the 
command prompt. Output from a run with mpirun options -v -d are in the attached 
file test.tar.gz.

Is there anyone out there that have tried to do some thing similar? 

Regards

Per Madsen
Senior scientist

       
         AARHUS UNIVERSITET / UNIVERSITY OF AARHUS     
Det Jordbrugsvidenskabelige Fakultet / Faculty of Agricultural Sciences
Forskningscenter Foulum / Research Centre Foulum       
Genetik og Bioteknologi / Dept. of Genetics and Biotechnology  
Blichers Allé 20, P.O. BOX 50  
DK-8830 Tjele  
       




Attachment: config.log.gz
Description: config.log.gz

eth0      Link encap:Ethernet  HWaddr 00:14:5E:C2:BB:E4  
          inet addr:10.55.55.65  Bcast:10.55.55.255  Mask:255.255.255.0
          inet6 addr: fe80::214:5eff:fec2:bbe4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:140268254 errors:0 dropped:0 overruns:0 frame:0
          TX packets:166380187 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:138443717024 (128.9 GiB)  TX bytes:201070313859 (187.2 GiB)
          Interrupt:17 Memory:da000000-da012100 

eth1      Link encap:Ethernet  HWaddr 00:14:5E:C2:BB:E6  
          inet addr:10.55.56.65  Bcast:10.55.56.255  Mask:255.255.255.0
          inet6 addr: fe80::214:5eff:fec2:bbe6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:639993727 errors:0 dropped:0 overruns:0 frame:0
          TX packets:518028570 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:845939849040 (787.8 GiB)  TX bytes:311070822710 (289.7 GiB)
          Interrupt:19 Memory:d8000000-d8012100 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:143166 errors:0 dropped:0 overruns:0 frame:0
          TX packets:143166 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:31459709 (30.0 MiB)  TX bytes:31459709 (30.0 MiB)

Attachment: ompi_info.log.gz
Description: ompi_info.log.gz

Attachment: test.tar.gz
Description: test.tar.gz

Reply via email to