Ralph,

This is my first build of OpenMPI so I haven't had this working before.  I'm 
pretty confident that PATH and LD_LIBRARY_PATH issues are not the cause, 
otherwise launches outside of PBS would fail too.  Also, I tried compiling 
everything statically with the same result.

Some additional info...  (1) I did a diff on tm.h for PBS 10.2 and from version 
8.0 that we had - they are identical, and (2) I've tried this with both the 
Intel 11.1 and GCC compilers and gotten the exact same run-time errors.

For now, I've got a a work-around setup that launches over ssh and still 
attaches the processes to PBS.

Thanks for your help.

Steve


________________________________
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Friday, February 12, 2010 8:29 PM
To: Open MPI Users
Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2

Afraid compilers don't help when the param is a void*...

It looks like this is consistent, but I've never tried it under that particular 
environment. Did prior versions of OMPI work, or are you trying this for the 
first time?

One thing you might check is that you have the correct PATH and LD_LIBRARY_PATH 
set to point to this version of OMPI and the corresponding PBS Pro libs you 
used to build it. Most Linux distros come with OMPI installed, and that can 
cause surprises.

We run under Torque at major installations every day, so it -should- 
work...unless PBS Pro has done something unusual.


On Feb 12, 2010, at 1:41 PM, Repsher, Stephen J wrote:

Yes, the failure seems to be in mpirun, it never even gets to my application.

The proto for tm_init looks like this:
int tm_init(void *info, struct tm_roots *roots);

where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x tm_task_id

If the API was different, wouldn't the compiler most likely generate an error 
at compile-time?

Thanks!

Steve


________________________________
From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> 
[mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, February 12, 2010 3:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2

I'm a tad confused - this trace would appear to indicate that mpirun is 
failing, yes? Not your application?

The reason it works for local procs is that tm_init isn't called for that case 
- mpirun just fork/exec's the procs directly. When remote nodes are required, 
mpirun must connect to Torque. This is done with a call to:

        ret = tm_init(NULL, &tm_root);

My guess is that something changed in PBS Pro 10.2 to that API. Can you check 
the tm header file and see? I have no access to PBS any more, so I'll have to 
rely on your eyes to see a diff.

Thanks
Ralph

On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote:

Hello,

I'm having problems running Open MPI jobs under PBS Pro 10.2.  I've configured 
and built OpenMPI 1.4.1 with the Intel 11.1 compiler on Linux and with 
--with-tm support and the build runs fine.  I've also built with static 
libraries per the FAQ suggestion since libpbs is static.  However, my test 
application keep failing with a segmentation fault, but ONLY when trying to 
select more than 1 node.  Running on a single node withing PBS works fine.  
Also, running outside of PBS vis ssh runs fine as well, even across multiple 
nodes.  OpenIB support is also enabled, but that doesn't seem to affect the 
error because I've also tried running with the --mca btl tcp,self flag and it 
still doesn't work.  Here is the error I'm getting:

[n34:26892] *** Process received signal ***
[n34:26892] Signal: Segmentation fault (11)
[n34:26892] Signal code: Address not mapped (1)
[n34:26892] Failing at address: 0x3f
[n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90]
[n34:26892] [ 1] 
/part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(discui_+0x84) [0x476a50]
[n34:26892] [ 2] 
/part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(diswsi+0xc3) [0x474063]
[n34:26892] [ 3] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x471d0c]
[n34:26892] [ 4] 
/part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(tm_init+0x1fe) [0x471ff8]
[n34:26892] [ 5] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x43f580]
[n34:26892] [ 6] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x413921]
[n34:26892] [ 7] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x412b78]
[n34:26892] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6) [0x7fc03068d586]
[n34:26892] [ 9] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x412ac9]
[n34:26892] *** End of error message ***
Segmentation fault

(NOTE: pbs_mpirun = orterun, mpirun, etc.)

Has anyone else seen errors like this within PBS?

============================================
Steve Repsher
Boeing Defense, Space, & Security - Rotorcraft
Aerodynamics/CFD
Phone: (610) 591-1510
Fax: (610) 591-6263
stephen.j.reps...@boeing.com<mailto:stephen.j.reps...@boeing.com>



_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to