I'm a tad confused - this trace would appear to indicate that mpirun is 
failing, yes? Not your application?

The reason it works for local procs is that tm_init isn't called for that case 
- mpirun just fork/exec's the procs directly. When remote nodes are required, 
mpirun must connect to Torque. This is done with a call to:

        ret = tm_init(NULL, &tm_root);

My guess is that something changed in PBS Pro 10.2 to that API. Can you check 
the tm header file and see? I have no access to PBS any more, so I'll have to 
rely on your eyes to see a diff.

Thanks
Ralph

On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote:

> Hello,
> 
> I'm having problems running Open MPI jobs under PBS Pro 10.2.  I've 
> configured and built OpenMPI 1.4.1 with the Intel 11.1 compiler on Linux and 
> with --with-tm support and the build runs fine.  I've also built with static 
> libraries per the FAQ suggestion since libpbs is static.  However, my test 
> application keep failing with a segmentation fault, but ONLY when trying to 
> select more than 1 node.  Running on a single node withing PBS works fine.  
> Also, running outside of PBS vis ssh runs fine as well, even across multiple 
> nodes.  OpenIB support is also enabled, but that doesn't seem to affect the 
> error because I've also tried running with the --mca btl tcp,self flag and it 
> still doesn't work.  Here is the error I'm getting:
> 
> [n34:26892] *** Process received signal ***
> [n34:26892] Signal: Segmentation fault (11)
> [n34:26892] Signal code: Address not mapped (1)
> [n34:26892] Failing at address: 0x3f
> [n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90]
> [n34:26892] [ 1] 
> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(discui_+0x84) [0x476a50]
> [n34:26892] [ 2] 
> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(diswsi+0xc3) [0x474063]
> [n34:26892] [ 3] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x471d0c]
> [n34:26892] [ 4] 
> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(tm_init+0x1fe) [0x471ff8]
> [n34:26892] [ 5] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x43f580]
> [n34:26892] [ 6] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x413921]
> [n34:26892] [ 7] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x412b78]
> [n34:26892] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6) [0x7fc03068d586]
> [n34:26892] [ 9] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x412ac9]
> [n34:26892] *** End of error message ***
> Segmentation fault
> 
> (NOTE: pbs_mpirun = orterun, mpirun, etc.)
> 
> Has anyone else seen errors like this within PBS?
> 
> ============================================
> Steve Repsher
> Boeing Defense, Space, & Security - Rotorcraft
> Aerodynamics/CFD
> Phone: (610) 591-1510
> Fax: (610) 591-6263
> stephen.j.reps...@boeing.com
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to