[OMPI users] torque integration when tm ras/plm isn't compiled in.
Hi all. I'm trying to create a tight integration between torque and openmpi for cases where the tm ras and plm isn't compiled into openmpi. This scenario is common for linux distros that ship openmpi. Of course the ideal solution is to recompile openmpi with torque support, but this isn't always feasible since I do not want to support my own version of openmpi on the stuff I'm distributing to others. We also see some proprietary applications shipping their own embedded openmpi libraries where either tm plm/ras is missing or non-functional with the torque installation on our system. So, I've come so far as to create a pbsdshwrapper.py that mimics ssh behaviour very closely so that starting the orteds on all the hosts works as expected and the application starts correctly when I use setenv OMPI_MCA_plm_rsh_agent "pbsdshwrapper.py" mpirun --hostfile $PBS_NODEFILE What I want now is a way to get rid of the --hostfile $PBS_NODEFILE in the mpirun command. Is there an environment variable that I can set so that mpirun grabs the right nodelist? By spelunking the code I find that the rsh plm has support for SGE where it automatically picks up the PE_NODEFILE if it detects that it is launched within an SGE job. Would it be possible to have the same functionality for torque? The code looks a bit too complex at first sight for me to fix this myself. Best regards, Roy. -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.drags...@uit.no
Re: [OMPI users] torque integration when tm ras/plm isn't compiled in.
On Friday 23 October 2009 00:50:00 Ralph Castain wrote: > Why not just > > setenv OMPI_MCA_orte_default_hostfile $PBS_NODEFILE > > assuming you are using 1.3.x, of course. > > If not, then you can use the equivalent for 1.2 - ompi_info would tell > you the name of it. THANKS! Just what I was looking for. Been looking up and down for it, but couldn't find the right swear words. Is it also possible to disable the backgrounding of the orted daemons? When they fork into background one looses the feedback about cpu usage in the job. Not really a big issue though... Regards, r. -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.drags...@uit.no
[OMPI users] Memory corruption?
Hi. We have started to scale up one of our codes and sometimes we get messages like this: [c9-13.local:31125] Memory 0x2aaab7b64000:217088 cannot be freed from the registration cache. Possible memory corruption. It seems like the application runs normally and it does not crash becaus of this. Should we be worried? We have tested the code with up to 1700 cores and the message becomes more frequent as we scale up. System details: Rocks 5.2 (aka CentOS 5.3) x86_64 INTEL Compiler 11.1 OFED 1.4.1 OpenMPI 1.3.3 Best regards and Merry Christmas to all, r. -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.drags...@uit.no
Re: [OMPI users] Memory corruption?
> Hi. > > We have started to scale up one of our codes and sometimes we get messages > like this: > > [c9-13.local:31125] Memory 0x2aaab7b64000:217088 cannot be freed from > the registration cache. Possible memory corruption. > > It seems like the application runs normally and it does not crash becaus of > this. Should we be worried? We have tested the code with up to 1700 cores > and the message becomes more frequent as we scale up. Nevermind, this turned out to be an application bug. A buffer was freed before Isend completed. r. -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.drags...@uit.no
[OMPI users] Openmpi 1.3 problems with libtool-ltdl on CentOS 4 and 5
Hi, all. I do not know if this is to be considered a real bug or not, I'm just reporting it here so people can find it if they google around for the error message this produces. There is a backtrace at the end of this mail. Problem description: Openmpi 1.3 seems to be nonfunctional when used with libltdl in libtool v1.5 that is installed on CentOS (aka RH EL) 4 and 5. Upgrading to libtool v2.2.6a (and maybe earlier versions) solves the problem. We saw this problem with both gcc and icc. Here is a code snippet that is extracted from the real application. nestcrash.c: #include #include int main(int argc,char *argv[]) { MPI_Init(&argc,&argv); char *dummy="dummy"; const lt_dlhandle hModule = lt_dlopenext(dummy); } This will crash in MPI_Init when using libtool 1.5.X, if you comment out lt_dlopenext it will run normally. I can provide a complete example if neccessary. As I said earlier, upgrading to libtool 2.2.6a solved the problem for us. Here is the backtrace: *** Process received signal *** Signal: Segmentation fault (11) Signal code: (128) Failing at address: (nil) [ 0] /lib64/tls/libpthread.so.0 [0x3ffce0c4f0] [ 1] /global/apps/openmpi/1.3rc2/lib/libopen-pal.so.0 [0x2a95d4bce5] [ 2] /global/apps/openmpi/1.3rc2/lib/libopen-pal.so.0(lt_dlopenadvise +0xf0) [0x2a95d4b470] [ 3] /global/apps/openmpi/1.3rc2/lib/libopen-pal.so.0 [0x2a95d56e1f] [ 4] /global/apps/openmpi/1.3rc2/lib/libopen- pal.so.0(mca_base_component_find+0x58d) [0x2a95d5657d] [ 5] /global/apps/openmpi/1.3rc2/lib/libopen- pal.so.0(mca_base_components_open+0x1ae) [0x2a95d581be] [ 6] /global/apps/openmpi/1.3rc2/lib/libopen- pal.so.0(opal_paffinity_base_open+0xad) [0x2a95d73ddd] [ 7] /global/apps/openmpi/1.3rc2/lib/libopen-pal.so.0(opal_init+0x64) [0x2a95d43e64] [ 8] /global/apps/openmpi/1.3rc2/lib/libopen-rte.so.0(orte_init+0x1e) [0x2a95bdeb8e] [ 9] /global/apps/openmpi/1.3rc2/lib/libmpi.so.0 [0x2a95a38fee] [10] /global/apps/openmpi/1.3rc2/lib/libmpi.so.0(PMPI_Init_thread+0x72) [0x2a95a5b9c2] [11] nest-ompi_1.3rc2/bin/nest(_ZN4nest12Communicator4initEPiPPPc+0x11f) [0x55440f] [12] nest-ompi_1.3rc2/bin/nest(main+0x74) [0x4a7674] [13] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x339271c3fb] [14] nest-ompi_1.3rc2/bin/nest(_ZNSt8ios_base4InitD1Ev+0x5a) [0x4a756a] *** End of error message *** -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.drags...@uit.no
Re: [OMPI users] Openmpi 1.3 problems with libtool-ltdl on CentOS 4 and 5
On Friday 23 January 2009 15:31:59 Jeff Squyres wrote: > Ew. Yes, I can see this being a problem. > > I'm guessing that the real issue is that OMPI embeds the libltdl from > LT 2.2.6a inside libopen_pal (one of the internal OMPI libraries). > Waving my hands a bit, but it's not hard to imagine some sort of clash > is going on between the -lltdl you added to the command line and the > libltdl that is embedded in OMPI's libraries. > > Can you verify that this is what is happening? Hi, sorry for the delay. I'm not very familiar with the workings of ltdl, I got this from one of our users. Would you suggest that if one use openmpi 1.3 and ltdl you should not explicitly link with -lltdl? At least this seems to work correctly with the example I posted. That is, I can link the program without specifying -lltdl so the symbol seems to resolve to something in the openmpi libraries and the example runs without crashing. Regards, r. -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.drags...@uit.no