Hi,

We have Xeon dual cpu cluster on redhat. I have compiled openMPI 1.2.6 with g95 and AMBER (scientific program doing parallel molecular simulations; Fortran 77&90). Both compilation seems to be fine. However, AMBER runs from command prompt "mpiexec -np x <exe ...>" successfully, but using PBS batch system fails to run in parallel and runs only using single CPU. I get errors like:

[Morpheus06:02155] *** Process received signal ***
[Morpheus06:02155] Signal: Segmentation fault (11)
[Morpheus06:02155] Signal code: Address not mapped (1)
[Morpheus06:02155] Failing at address: 0x39000000
[Morpheus06:02155] [ 0] /lib/tls/libpthread.so.0 [0x401ad610]
[Morpheus06:02155] [ 1] /lib/tls/libc.so.6 [0x420eb85e]
[Morpheus06:02155] [ 2] /lib/tls/libc.so.6(__cxa_finalize+0x7e) [0x42029eae] [Morpheus06:02155] [ 3] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0 [0x40018325] [Morpheus06:02155] [ 4] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0 [0x400190f6]
[Morpheus06:02155] [ 5] /lib/ld-linux.so.2 [0x4000c894]
[Morpheus06:02155] [ 6] /lib/tls/libc.so.6(exit+0x70) [0x42029c20]
[Morpheus06:02155] [ 7] /home/aziemys/bin/amber9/exe/sander.MPI [0x82beb63]
[Morpheus06:02155] [ 8] /home/aziemys/bin/amber9/exe/sander.MPI(_g95_exit_4+0x2c) [0x82bd648] [Morpheus06:02155] [ 9] /home/aziemys/bin/amber9/exe/sander.MPI(mexit_+0x9f) [0x817cd03] [Morpheus06:02155] [10] /home/aziemys/bin/amber9/exe/sander.MPI(MAIN_+0x3639) [0x80e8e51] [Morpheus06:02155] [11] /home/aziemys/bin/amber9/exe/sander.MPI(main+0x2d) [0x82bb471] [Morpheus06:02155] [12] /lib/tls/libc.so.6(__libc_start_main+0xe4) [0x42015574] [Morpheus06:02155] [13] /home/aziemys/bin/amber9/exe/sander.MPI(sinh+0x49) [0x80697a1]
[Morpheus06:02155] *** End of error message ***
mpiexec noticed that job rank 0 with PID 2150 on node Morpheus06 exited on signal 11 (Segmentation fault).
5 additional processes aborted (not shown)

If I decide to supply machine file ($PBS_NODEFILE), it fails with :

Host key verification failed.
Host key verification failed.
[Morpheus06:02107] ERROR: A daemon on node Morpheus09 failed to start as expected.
[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [Morpheus06:02107] ERROR: A daemon on node Morpheus07 failed to start as expected.
[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198
--------------------------------------------------------------------------
mpiexec was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------

Help, please.

--

Arturas Ziemys, PhD
 School of Health Information Sciences
 University of Texas Health Science Center at Houston
 7000 Fannin, Suit 880
 Houston, TX 77030
 Phone: (713) 500-3975
Fax: (713) 500-3929

Reply via email to