Sorry for the delay in replying -- I was on vacation for a week and
all the mail piled up...
That is a very weird stack trace. Is the application finishing and
then crashing during the shutdown?
I'd be surprised if the problem is actually related to PBS (the stack
trace would be quite different). I wonder if the real problem was
that it only started one process, and Amber was unable to handle that
nicely...?
Are you sure that you have PBS support compiled in Open MPI properly?
Check ompi_info | grep tm. You should see a line like this:
MCA pls: tm (MCA v1.0, API v1.0.1, Component v1.2.6)
If you don't see a "pls: tm" line, then your OMPI was not configured
with PBS support, and mpiexec may have only started one copy of
Amber...?
As for trying to use a hostfile, I think the real errors are here:
Host key verification failed.
Host key verification failed.
It seems that you ssh is not setup properly...?
On Jun 12, 2008, at 11:52 AM, Arturas Ziemys wrote:
Hi,
We have Xeon dual cpu cluster on redhat. I have compiled openMPI 1.2.6
with g95 and AMBER (scientific program doing parallel molecular
simulations; Fortran 77&90). Both compilation seems to be fine.
However,
AMBER runs from command prompt "mpiexec -np x <exe ...>" successfully,
but using PBS batch system fails to run in parallel and runs only
using
single CPU. I get errors like:
[Morpheus06:02155] *** Process received signal ***
[Morpheus06:02155] Signal: Segmentation fault (11)
[Morpheus06:02155] Signal code: Address not mapped (1)
[Morpheus06:02155] Failing at address: 0x39000000
[Morpheus06:02155] [ 0] /lib/tls/libpthread.so.0 [0x401ad610]
[Morpheus06:02155] [ 1] /lib/tls/libc.so.6 [0x420eb85e]
[Morpheus06:02155] [ 2] /lib/tls/libc.so.6(__cxa_finalize+0x7e)
[0x42029eae]
[Morpheus06:02155] [ 3] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0
[0x40018325]
[Morpheus06:02155] [ 4] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0
[0x400190f6]
[Morpheus06:02155] [ 5] /lib/ld-linux.so.2 [0x4000c894]
[Morpheus06:02155] [ 6] /lib/tls/libc.so.6(exit+0x70) [0x42029c20]
[Morpheus06:02155] [ 7] /home/aziemys/bin/amber9/exe/sander.MPI
[0x82beb63]
[Morpheus06:02155] [ 8]
/home/aziemys/bin/amber9/exe/sander.MPI(_g95_exit_4+0x2c) [0x82bd648]
[Morpheus06:02155] [ 9]
/home/aziemys/bin/amber9/exe/sander.MPI(mexit_+0x9f) [0x817cd03]
[Morpheus06:02155] [10]
/home/aziemys/bin/amber9/exe/sander.MPI(MAIN_+0x3639) [0x80e8e51]
[Morpheus06:02155] [11]
/home/aziemys/bin/amber9/exe/sander.MPI(main+0x2d) [0x82bb471]
[Morpheus06:02155] [12] /lib/tls/libc.so.6(__libc_start_main+0xe4)
[0x42015574]
[Morpheus06:02155] [13]
/home/aziemys/bin/amber9/exe/sander.MPI(sinh+0x49) [0x80697a1]
[Morpheus06:02155] *** End of error message ***
mpiexec noticed that job rank 0 with PID 2150 on node Morpheus06
exited
on signal 11 (Segmentation fault).
5 additional processes aborted (not shown)
If I decide to supply machine file ($PBS_NODEFILE), it fails with :
Host key verification failed.
Host key verification failed.
[Morpheus06:02107] ERROR: A daemon on node Morpheus09 failed to
start as
expected.
[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status
255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1166
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
errmgr_hnp.c
at line 90
[Morpheus06:02107] ERROR: A daemon on node Morpheus07 failed to
start as
expected.
[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status
255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1198
--------------------------------------------------------------------------
mpiexec was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------
Help, please.
--
Arturas Ziemys, PhD
School of Health Information Sciences
University of Texas Health Science Center at Houston
7000 Fannin, Suit 880
Houston, TX 77030
Phone: (713) 500-3975
Fax: (713) 500-3929
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems