Could you please ask them about this:

OMPI makes the following call to connect to the mother superior:

struct tm_roots tm_root;
ret = tm_init(NULL, &tm_root);

Could they tell us why this segfaults in PBS Pro? It works correctly with all 
releases of Torque.

Thanks
Ralph

On Feb 15, 2010, at 12:06 PM, Joshua Bernstein wrote:

> Well,
> 
>       We all wish the Altair guys would at least try to maintain backwards 
> compatibility with the community, but they have a big habit of breaking 
> things. This isn't the first time they've broken a more customer facing 
> function like tm_spawn. (The also like breaking pbs_statjob too!).
> 
>       I have access to PBS Pro and I can raise the issue with Altair if it 
> would help. Just let me know how I can be helpful.
> 
> -Joshua Bernstein
> Senior Software Engineer
> Penguin Computing
> 
> On Feb 15, 2010, at 8:23 AM, Jeff Squyres wrote:
> 
>> Bummer!
>> 
>> If it helps, could you put us in touch with the PBS Pro people?  We usually 
>> only have access to Torque when developing the TM-launching stuff (PBS Pro 
>> and Torque supposedly share the same TM interface, but we don't have access 
>> to PBS Pro, so we don't know if it has diverged over time).
>> 
>> 
>> On Feb 15, 2010, at 8:13 AM, Repsher, Stephen J wrote:
>> 
>>> Ralph,
>>> 
>>> This is my first build of OpenMPI so I haven't had this working before.  
>>> I'm pretty confident that PATH and LD_LIBRARY_PATH issues are not the 
>>> cause, otherwise launches outside of PBS would fail too.  Also, I tried 
>>> compiling everything statically with the same result.
>>> 
>>> Some additional info...  (1) I did a diff on tm.h for PBS 10.2 and from 
>>> version 8.0 that we had - they are identical, and (2) I've tried this with 
>>> both the Intel 11.1 and GCC compilers and gotten the exact same run-time 
>>> errors.
>>> 
>>> For now, I've got a a work-around setup that launches over ssh and still 
>>> attaches the processes to PBS.
>>> 
>>> Thanks for your help.
>>> 
>>> Steve
>>> 
>>> 
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>>> Behalf Of Ralph Castain
>>> Sent: Friday, February 12, 2010 8:29 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
>>> 
>>> Afraid compilers don't help when the param is a void*...
>>> 
>>> It looks like this is consistent, but I've never tried it under that 
>>> particular environment. Did prior versions of OMPI work, or are you trying 
>>> this for the first time?
>>> 
>>> One thing you might check is that you have the correct PATH and 
>>> LD_LIBRARY_PATH set to point to this version of OMPI and the corresponding 
>>> PBS Pro libs you used to build it. Most Linux distros come with OMPI 
>>> installed, and that can cause surprises.
>>> 
>>> We run under Torque at major installations every day, so it -should- 
>>> work...unless PBS Pro has done something unusual.
>>> 
>>> 
>>> On Feb 12, 2010, at 1:41 PM, Repsher, Stephen J wrote:
>>> 
>>>> Yes, the failure seems to be in mpirun, it never even gets to my 
>>>> application.
>>>> 
>>>> The proto for tm_init looks like this:
>>>> int tm_init(void *info, struct tm_roots *roots);
>>>> 
>>>> where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x tm_task_id
>>>> 
>>>> If the API was different, wouldn't the compiler most likely generate an 
>>>> error at compile-time?
>>>> 
>>>> Thanks!
>>>> 
>>>> Steve
>>>> 
>>>> 
>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>>>> Behalf Of Ralph Castain
>>>> Sent: Friday, February 12, 2010 3:21 PM
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
>>>> 
>>>> I'm a tad confused - this trace would appear to indicate that mpirun is 
>>>> failing, yes? Not your application?
>>>> 
>>>> The reason it works for local procs is that tm_init isn't called for that 
>>>> case - mpirun just fork/exec's the procs directly. When remote nodes are 
>>>> required, mpirun must connect to Torque. This is done with a call to:
>>>> 
>>>>       ret = tm_init(NULL, &tm_root);
>>>> 
>>>> My guess is that something changed in PBS Pro 10.2 to that API. Can you 
>>>> check the tm header file and see? I have no access to PBS    any more, so 
>>>> I'll have to rely on your eyes to see a diff.
>>>> 
>>>> Thanks
>>>> Ralph
>>>> 
>>>> On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> I'm having problems running Open MPI jobs under PBS Pro 10.2.  I've 
>>>>> configured and built OpenMPI 1.4.1 with the Intel 11.1 compiler on Linux 
>>>>> and with --with-tm support and the build runs fine.  I've also built with 
>>>>> static libraries per the FAQ suggestion since libpbs is static.  However, 
>>>>> my test application keep failing with a segmentation fault, but ONLY when 
>>>>> trying to select more than 1 node.  Running on a single node withing PBS 
>>>>> works fine.  Also, running outside of PBS vis ssh runs fine as well, even 
>>>>> across multiple nodes.  OpenIB support is also enabled, but that doesn't 
>>>>> seem to affect the error because I've also tried running with the --mca 
>>>>> btl tcp,self flag and it still doesn't work.  Here is the error I'm 
>>>>> getting:
>>>>> 
>>>>> [n34:26892] *** Process received signal ***
>>>>> [n34:26892] Signal: Segmentation fault (11)
>>>>> [n34:26892] Signal code: Address not mapped (1)
>>>>> [n34:26892] Failing at address: 0x3f
>>>>> [n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90]
>>>>> [n34:26892] [ 1] 
>>>>> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(discui_+0x84) 
>>>>> [0x476a50]
>>>>> [n34:26892] [ 2] 
>>>>> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(diswsi+0xc3) [0x474063]
>>>>> [n34:26892] [ 3] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun 
>>>>> [0x471d0c]
>>>>> [n34:26892] [ 4] 
>>>>> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(tm_init+0x1fe) 
>>>>> [0x471ff8]
>>>>> [n34:26892] [ 5] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun 
>>>>> [0x43f580]
>>>>> [n34:26892] [ 6] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun 
>>>>> [0x413921]
>>>>> [n34:26892] [ 7] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun 
>>>>> [0x412b78]
>>>>> [n34:26892] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6) [0x7fc03068d586]
>>>>> [n34:26892] [ 9] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun 
>>>>> [0x412ac9]
>>>>> [n34:26892] *** End of error message ***
>>>>> Segmentation fault
>>>>> 
>>>>> (NOTE: pbs_mpirun = orterun, mpirun, etc.)
>>>>> 
>>>>> Has anyone else seen errors like this within PBS?
>>>>> 
>>>>> ============================================
>>>>> Steve Repsher
>>>>> Boeing Defense, Space, & Security - Rotorcraft
>>>>> Aerodynamics/CFD
>>>>> Phone: (610) 591-1510
>>>>> Fax: (610) 591-6263
>>>>> stephen.j.reps...@boeing.com
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> 
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to