I heard back from my Altair contact this morning. He told me that they
did in fact make a change in some version of 10.x that broke this. They
don't have a workaround for v10, but he said it was fixed in v11.x.
I built OpenMPI 1.5.3 this morning with PBSPro v11.0, and it works fine.
I don't get any segfaults.
-Justin.
On 07/26/2011 05:49 PM, Ralph Castain wrote:
I don't believe we ever got anywhere with this due to lack of response. If you
get some info on what happened to tm_init, please pass it along.
Best guess: something changed in a recent PBS Pro release. Since none of us
have access to it, we don't know what's going on. :-(
On Jul 26, 2011, at 10:10 AM, Wood, Justin Contractor, SAIC wrote:
I'm having a problem using OpenMPI under PBS Pro 10.4. I tried both 1.4.3 and
1.5.3, both behave the same. I'm able to run just fine if I don't use PBS and
go direct to the nodes. Also, if I run under PBS and use only 1 node, it works
fine, but as soon as I span nodes, I get the following:
[a4ou-n501:07366] *** Process received signal ***
[a4ou-n501:07366] Signal: Segmentation fault (11)
[a4ou-n501:07366] Signal code: Address not mapped (1)
[a4ou-n501:07366] Failing at address: 0x3f
[a4ou-n501:07366] [ 0] /lib64/libpthread.so.0 [0x3f2b20eb10]
[a4ou-n501:07366] [ 1] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0(discui_+0x84)
[0x2affa453765c]
[a4ou-n501:07366] [ 2] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0(diswsi+0xc3)
[0x2affa4534c6f]
[a4ou-n501:07366] [ 3] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0
[0x2affa453290c]
[a4ou-n501:07366] [ 4]
/opt/ompi/1.4.3/intel/lib/libopen-rte.so.0(tm_init+0x1fe) [0x2affa4532bf8]
[a4ou-n501:07366] [ 5] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0
[0x2affa452691c]
[a4ou-n501:07366] [ 6] mpirun [0x404c17]
[a4ou-n501:07366] [ 7] mpirun [0x403e28]
[a4ou-n501:07366] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3f2a61d994]
[a4ou-n501:07366] [ 9] mpirun [0x403d59]
[a4ou-n501:07366] *** End of error message ***
Segmentation fault
I searched the archives and found a similar issue from last year:
http://www.open-mpi.org/community/lists/users/2010/02/12084.php
The last update I saw was that someone was going to contact Altair and have
them look at why it was failing to do the tm_init. Does anyone have an update
to this, and has anyone been able to run successfully using recent versions of
PBSPro? I've also contacted our rep at Altair, but he hasn't responded yet.
Thanks, Justin.
Justin Wood
Systems Engineer
FNMOC | SAIC
7 Grace Hopper, Stop 1
Monterey, CA
justin.g.wood....@navy.mil
justin.g.w...@saic.com
office: 831.656.4671
mobile: 831.869.1576
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Justin Wood
Systems Engineer
FNMOC | SAIC
7 Grace Hopper, Stop 1
Monterey, CA
justin.g.wood....@navy.mil
justin.g.w...@saic.com
office: 831.656.4671
mobile: 831.869.1576