Well,
We all wish the Altair guys would at least try to maintain backwards
compatibility with the community, but they have a big habit of
breaking things. This isn't the first time they've broken a more
customer facing function like tm_spawn. (The also like breaking
pbs_statjob too!).
I have access to PBS Pro and I can raise the issue with Altair if it
would help. Just let me know how I can be helpful.
-Joshua Bernstein
Senior Software Engineer
Penguin Computing
On Feb 15, 2010, at 8:23 AM, Jeff Squyres wrote:
Bummer!
If it helps, could you put us in touch with the PBS Pro people? We
usually only have access to Torque when developing the TM-launching
stuff (PBS Pro and Torque supposedly share the same TM interface,
but we don't have access to PBS Pro, so we don't know if it has
diverged over time).
On Feb 15, 2010, at 8:13 AM, Repsher, Stephen J wrote:
Ralph,
This is my first build of OpenMPI so I haven't had this working
before. I'm pretty confident that PATH and LD_LIBRARY_PATH issues
are not the cause, otherwise launches outside of PBS would fail
too. Also, I tried compiling everything statically with the same
result.
Some additional info... (1) I did a diff on tm.h for PBS 10.2 and
from version 8.0 that we had - they are identical, and (2) I've
tried this with both the Intel 11.1 and GCC compilers and gotten
the exact same run-time errors.
For now, I've got a a work-around setup that launches over ssh and
still attaches the processes to PBS.
Thanks for your help.
Steve
From: users-boun...@open-mpi.org [mailto:users-bounces@open-
mpi.org] On Behalf Of Ralph Castain
Sent: Friday, February 12, 2010 8:29 PM
To: Open MPI Users
Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
Afraid compilers don't help when the param is a void*...
It looks like this is consistent, but I've never tried it under
that particular environment. Did prior versions of OMPI work, or
are you trying this for the first time?
One thing you might check is that you have the correct PATH and
LD_LIBRARY_PATH set to point to this version of OMPI and the
corresponding PBS Pro libs you used to build it. Most Linux distros
come with OMPI installed, and that can cause surprises.
We run under Torque at major installations every day, so it -
should- work...unless PBS Pro has done something unusual.
On Feb 12, 2010, at 1:41 PM, Repsher, Stephen J wrote:
Yes, the failure seems to be in mpirun, it never even gets to my
application.
The proto for tm_init looks like this:
int tm_init(void *info, struct tm_roots *roots);
where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x
tm_task_id
If the API was different, wouldn't the compiler most likely
generate an error at compile-time?
Thanks!
Steve
From: users-boun...@open-mpi.org [mailto:users-bounces@open-
mpi.org] On Behalf Of Ralph Castain
Sent: Friday, February 12, 2010 3:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
I'm a tad confused - this trace would appear to indicate that
mpirun is failing, yes? Not your application?
The reason it works for local procs is that tm_init isn't called
for that case - mpirun just fork/exec's the procs directly. When
remote nodes are required, mpirun must connect to Torque. This is
done with a call to:
ret = tm_init(NULL, &tm_root);
My guess is that something changed in PBS Pro 10.2 to that API.
Can you check the tm header file and see? I have no access to
PBS any more, so I'll have to rely on your eyes to see a diff.
Thanks
Ralph
On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote:
Hello,
I'm having problems running Open MPI jobs under PBS Pro 10.2.
I've configured and built OpenMPI 1.4.1 with the Intel 11.1
compiler on Linux and with --with-tm support and the build runs
fine. I've also built with static libraries per the FAQ
suggestion since libpbs is static. However, my test application
keep failing with a segmentation fault, but ONLY when trying to
select more than 1 node. Running on a single node withing PBS
works fine. Also, running outside of PBS vis ssh runs fine as
well, even across multiple nodes. OpenIB support is also
enabled, but that doesn't seem to affect the error because I've
also tried running with the --mca btl tcp,self flag and it still
doesn't work. Here is the error I'm getting:
[n34:26892] *** Process received signal ***
[n34:26892] Signal: Segmentation fault (11)
[n34:26892] Signal code: Address not mapped (1)
[n34:26892] Failing at address: 0x3f
[n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90]
[n34:26892] [ 1] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
pbs_mpirun(discui_+0x84) [0x476a50]
[n34:26892] [ 2] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
pbs_mpirun(diswsi+0xc3) [0x474063]
[n34:26892] [ 3] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
pbs_mpirun [0x471d0c]
[n34:26892] [ 4] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
pbs_mpirun(tm_init+0x1fe) [0x471ff8]
[n34:26892] [ 5] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
pbs_mpirun [0x43f580]
[n34:26892] [ 6] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
pbs_mpirun [0x413921]
[n34:26892] [ 7] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
pbs_mpirun [0x412b78]
[n34:26892] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6)
[0x7fc03068d586]
[n34:26892] [ 9] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
pbs_mpirun [0x412ac9]
[n34:26892] *** End of error message ***
Segmentation fault
(NOTE: pbs_mpirun = orterun, mpirun, etc.)
Has anyone else seen errors like this within PBS?
============================================
Steve Repsher
Boeing Defense, Space, & Security - Rotorcraft
Aerodynamics/CFD
Phone: (610) 591-1510
Fax: (610) 591-6263
stephen.j.reps...@boeing.com
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users