Sorry for the delay in replying -- I was on vacation for a week and all the mail piled up...

That is a very weird stack trace. Is the application finishing and then crashing during the shutdown?

I'd be surprised if the problem is actually related to PBS (the stack trace would be quite different). I wonder if the real problem was that it only started one process, and Amber was unable to handle that nicely...?

Are you sure that you have PBS support compiled in Open MPI properly? Check ompi_info | grep tm. You should see a line like this:

                 MCA pls: tm (MCA v1.0, API v1.0.1, Component v1.2.6)

If you don't see a "pls: tm" line, then your OMPI was not configured with PBS support, and mpiexec may have only started one copy of Amber...?

As for trying to use a hostfile, I think the real errors are here:

Host key verification failed.
Host key verification failed.

It seems that you ssh is not setup properly...?



On Jun 12, 2008, at 11:52 AM, Arturas Ziemys wrote:

Hi,

We have Xeon dual cpu cluster on redhat. I have compiled openMPI 1.2.6
with g95 and AMBER (scientific program doing parallel molecular
simulations; Fortran 77&90). Both compilation seems to be fine. However,
AMBER runs from command prompt "mpiexec -np x <exe ...>" successfully,
but using PBS batch system fails to run in parallel and runs only using
single CPU. I get errors like:

[Morpheus06:02155] *** Process received signal ***
[Morpheus06:02155] Signal: Segmentation fault (11)
[Morpheus06:02155] Signal code: Address not mapped (1)
[Morpheus06:02155] Failing at address: 0x39000000
[Morpheus06:02155] [ 0] /lib/tls/libpthread.so.0 [0x401ad610]
[Morpheus06:02155] [ 1] /lib/tls/libc.so.6 [0x420eb85e]
[Morpheus06:02155] [ 2] /lib/tls/libc.so.6(__cxa_finalize+0x7e)
[0x42029eae]
[Morpheus06:02155] [ 3] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0
[0x40018325]
[Morpheus06:02155] [ 4] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0
[0x400190f6]
[Morpheus06:02155] [ 5] /lib/ld-linux.so.2 [0x4000c894]
[Morpheus06:02155] [ 6] /lib/tls/libc.so.6(exit+0x70) [0x42029c20]
[Morpheus06:02155] [ 7] /home/aziemys/bin/amber9/exe/sander.MPI [0x82beb63]
[Morpheus06:02155] [ 8]
/home/aziemys/bin/amber9/exe/sander.MPI(_g95_exit_4+0x2c) [0x82bd648]
[Morpheus06:02155] [ 9]
/home/aziemys/bin/amber9/exe/sander.MPI(mexit_+0x9f) [0x817cd03]
[Morpheus06:02155] [10]
/home/aziemys/bin/amber9/exe/sander.MPI(MAIN_+0x3639) [0x80e8e51]
[Morpheus06:02155] [11]
/home/aziemys/bin/amber9/exe/sander.MPI(main+0x2d) [0x82bb471]
[Morpheus06:02155] [12] /lib/tls/libc.so.6(__libc_start_main+0xe4)
[0x42015574]
[Morpheus06:02155] [13]
/home/aziemys/bin/amber9/exe/sander.MPI(sinh+0x49) [0x80697a1]
[Morpheus06:02155] *** End of error message ***
mpiexec noticed that job rank 0 with PID 2150 on node Morpheus06 exited
on signal 11 (Segmentation fault).
5 additional processes aborted (not shown)

If I decide to supply machine file ($PBS_NODEFILE), it fails with :

Host key verification failed.
Host key verification failed.
[Morpheus06:02107] ERROR: A daemon on node Morpheus09 failed to start as
expected.
[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1166
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c
at line 90
[Morpheus06:02107] ERROR: A daemon on node Morpheus07 failed to start as
expected.
[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1198
--------------------------------------------------------------------------
mpiexec was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------

Help, please.

--

Arturas Ziemys, PhD
 School of Health Information Sciences
 University of Texas Health Science Center at Houston
 7000 Fannin, Suit 880
 Houston, TX 77030
 Phone: (713) 500-3975
 Fax:   (713) 500-3929

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to