Could you please ask them about this: OMPI makes the following call to connect to the mother superior:
struct tm_roots tm_root; ret = tm_init(NULL, &tm_root); Could they tell us why this segfaults in PBS Pro? It works correctly with all releases of Torque. Thanks Ralph On Feb 15, 2010, at 12:06 PM, Joshua Bernstein wrote: > Well, > > We all wish the Altair guys would at least try to maintain backwards > compatibility with the community, but they have a big habit of breaking > things. This isn't the first time they've broken a more customer facing > function like tm_spawn. (The also like breaking pbs_statjob too!). > > I have access to PBS Pro and I can raise the issue with Altair if it > would help. Just let me know how I can be helpful. > > -Joshua Bernstein > Senior Software Engineer > Penguin Computing > > On Feb 15, 2010, at 8:23 AM, Jeff Squyres wrote: > >> Bummer! >> >> If it helps, could you put us in touch with the PBS Pro people? We usually >> only have access to Torque when developing the TM-launching stuff (PBS Pro >> and Torque supposedly share the same TM interface, but we don't have access >> to PBS Pro, so we don't know if it has diverged over time). >> >> >> On Feb 15, 2010, at 8:13 AM, Repsher, Stephen J wrote: >> >>> Ralph, >>> >>> This is my first build of OpenMPI so I haven't had this working before. >>> I'm pretty confident that PATH and LD_LIBRARY_PATH issues are not the >>> cause, otherwise launches outside of PBS would fail too. Also, I tried >>> compiling everything statically with the same result. >>> >>> Some additional info... (1) I did a diff on tm.h for PBS 10.2 and from >>> version 8.0 that we had - they are identical, and (2) I've tried this with >>> both the Intel 11.1 and GCC compilers and gotten the exact same run-time >>> errors. >>> >>> For now, I've got a a work-around setup that launches over ssh and still >>> attaches the processes to PBS. >>> >>> Thanks for your help. >>> >>> Steve >>> >>> >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >>> Behalf Of Ralph Castain >>> Sent: Friday, February 12, 2010 8:29 PM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2 >>> >>> Afraid compilers don't help when the param is a void*... >>> >>> It looks like this is consistent, but I've never tried it under that >>> particular environment. Did prior versions of OMPI work, or are you trying >>> this for the first time? >>> >>> One thing you might check is that you have the correct PATH and >>> LD_LIBRARY_PATH set to point to this version of OMPI and the corresponding >>> PBS Pro libs you used to build it. Most Linux distros come with OMPI >>> installed, and that can cause surprises. >>> >>> We run under Torque at major installations every day, so it -should- >>> work...unless PBS Pro has done something unusual. >>> >>> >>> On Feb 12, 2010, at 1:41 PM, Repsher, Stephen J wrote: >>> >>>> Yes, the failure seems to be in mpirun, it never even gets to my >>>> application. >>>> >>>> The proto for tm_init looks like this: >>>> int tm_init(void *info, struct tm_roots *roots); >>>> >>>> where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x tm_task_id >>>> >>>> If the API was different, wouldn't the compiler most likely generate an >>>> error at compile-time? >>>> >>>> Thanks! >>>> >>>> Steve >>>> >>>> >>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >>>> Behalf Of Ralph Castain >>>> Sent: Friday, February 12, 2010 3:21 PM >>>> To: Open MPI Users >>>> Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2 >>>> >>>> I'm a tad confused - this trace would appear to indicate that mpirun is >>>> failing, yes? Not your application? >>>> >>>> The reason it works for local procs is that tm_init isn't called for that >>>> case - mpirun just fork/exec's the procs directly. When remote nodes are >>>> required, mpirun must connect to Torque. This is done with a call to: >>>> >>>> ret = tm_init(NULL, &tm_root); >>>> >>>> My guess is that something changed in PBS Pro 10.2 to that API. Can you >>>> check the tm header file and see? I have no access to PBS any more, so >>>> I'll have to rely on your eyes to see a diff. >>>> >>>> Thanks >>>> Ralph >>>> >>>> On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote: >>>> >>>>> Hello, >>>>> >>>>> I'm having problems running Open MPI jobs under PBS Pro 10.2. I've >>>>> configured and built OpenMPI 1.4.1 with the Intel 11.1 compiler on Linux >>>>> and with --with-tm support and the build runs fine. I've also built with >>>>> static libraries per the FAQ suggestion since libpbs is static. However, >>>>> my test application keep failing with a segmentation fault, but ONLY when >>>>> trying to select more than 1 node. Running on a single node withing PBS >>>>> works fine. Also, running outside of PBS vis ssh runs fine as well, even >>>>> across multiple nodes. OpenIB support is also enabled, but that doesn't >>>>> seem to affect the error because I've also tried running with the --mca >>>>> btl tcp,self flag and it still doesn't work. Here is the error I'm >>>>> getting: >>>>> >>>>> [n34:26892] *** Process received signal *** >>>>> [n34:26892] Signal: Segmentation fault (11) >>>>> [n34:26892] Signal code: Address not mapped (1) >>>>> [n34:26892] Failing at address: 0x3f >>>>> [n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90] >>>>> [n34:26892] [ 1] >>>>> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(discui_+0x84) >>>>> [0x476a50] >>>>> [n34:26892] [ 2] >>>>> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(diswsi+0xc3) [0x474063] >>>>> [n34:26892] [ 3] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun >>>>> [0x471d0c] >>>>> [n34:26892] [ 4] >>>>> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(tm_init+0x1fe) >>>>> [0x471ff8] >>>>> [n34:26892] [ 5] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun >>>>> [0x43f580] >>>>> [n34:26892] [ 6] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun >>>>> [0x413921] >>>>> [n34:26892] [ 7] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun >>>>> [0x412b78] >>>>> [n34:26892] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6) [0x7fc03068d586] >>>>> [n34:26892] [ 9] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun >>>>> [0x412ac9] >>>>> [n34:26892] *** End of error message *** >>>>> Segmentation fault >>>>> >>>>> (NOTE: pbs_mpirun = orterun, mpirun, etc.) >>>>> >>>>> Has anyone else seen errors like this within PBS? >>>>> >>>>> ============================================ >>>>> Steve Repsher >>>>> Boeing Defense, Space, & Security - Rotorcraft >>>>> Aerodynamics/CFD >>>>> Phone: (610) 591-1510 >>>>> Fax: (610) 591-6263 >>>>> stephen.j.reps...@boeing.com >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users