Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)
Date: Thu, 10 Dec 2009 17:57:27 -0500 From: Jeff Squyres On Dec 10, 2009, at 5:53 PM, Gus Correa wrote: > How does the efficiency of loopback > (let's say, over TCP and over IB) compare with "sm"? Definitely not as good; that's why we have sm. :-) I don't have any quantification of that assertion, though (i.e., no numbers to back that up). However, as Eugene wrote earlier you can actually increase the number of fifos used by the SM and avoid the hang that way. Unless you are really strapped for memory I think that would be the best way to go. --td
Re: [OMPI users] checkpoint opempi-1.3.3+sge62
Hi Josh Here you go the file. I will try to apply the trunk but I think that I broke-up my openmpi installation doing "something" and I don't know what :-( . I was modifying the mca parameters... When I send a job, the orted daemon expanded in the SLAVE host is launched in a bucle till they spend all the reserved memory. It is very strange so I will compile it again, I will reproduce the bug and then I will test the trunk. Thanks a lot for the support and tickets opened. Sergio sdiaz30279 0.0 0.0 1888 560 ?Ds 12:54 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/cesga/sge62/default/spool/compute sdiaz30286 0.0 0.0 52772 1188 ?D12:54 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted -mca ess env -mca orte_ess_jobid 219 sdiaz30322 0.0 0.0 52772 1188 ?S12:54 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted sdiaz30358 0.0 0.0 52772 1188 ?D12:54 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted sdiaz30394 0.0 0.0 52772 1188 ?D12:54 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted sdiaz30430 0.0 0.0 52772 1188 ?D12:54 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted sdiaz30466 0.0 0.0 52772 1188 ?D12:54 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted sdiaz30502 0.0 0.0 52772 1188 ?D12:54 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted sdiaz30538 0.0 0.0 52772 1188 ?D12:54 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted sdiaz30574 0.0 0.0 52772 1188 ?D12:54 0:00 \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted Josh Hursey escribió: On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote: Hi Josh, You were right. The main problem was the /tmp. SGE uses a scratch directory in which the jobs have temporary files. Setting TMPDIR to /tmp, checkpoint works! However, when I try to restart it... I got the following error (see ERROR1). Option -v agrees these lines (see ERRO2). It is concerning that ompi-restart is segfault'ing when it errors out. The error message is being generated between the launch of the opal-restart starter command and when we try to exec(cr_restart). Usually the failure is related to a corruption of the metadata stored in the checkpoint. Can you send me the file below: ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data I was able to reproduce the segv (at least I think it is the same one). We failed to check the validity of a string when we parse the metadata. I committed a fix to the trunk in r22290, and requested that the fix be moved to the v1.4 and v1.5 branches. If you are interested in seeing when they get applied you can follow the following tickets: https://svn.open-mpi.org/trac/ompi/ticket/2140 https://svn.open-mpi.org/trac/ompi/ticket/2141 Can you try the trunk to see if the problem goes away? The development trunk and v1.5 series have a bunch of improvements to the C/R functionality that were never brought over the v1.3/v1.4 series. I was trying to use ssh instead of rsh but I was impossible. By default it should use ssh and if it finds a problem, it will use rsh. It seems that ssh doesn't work because always use rsh. If I change this MCA parameter, It still uses rsh. If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to use ssh and doesn't works. I got --> "bash: orted: command not found" and the mpi process dies. The command which try to execute is the following and I haven't found yet the reason why this command doesn't found orted because I set the /etc/bashrc in order to get always the right path and I have the right path into my application. (see ERROR4). This seems like an SGE specific issue, so a bit out of my domain. Maybe others have suggestions here. -- Josh Many thanks!, Sergio P.S. Sorry about these long emails. I just try to show you useful information to identify my problems. ERROR 1 > > [sdiaz@compute-3-18 ~]$ ompi-restart ompi_global_snapshot_28454.ckpt > -- > Error: Unable to obtain the proper restart command to restart from the >checkpoint file (opal_snapshot_0.ckpt). Returned -1. > > -- > -- > Error: Unable to obtain the proper restart command to restart from the >checkpoint file (opal_snapshot_1.ckpt). Returned -1. > > -
Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)
On my system, mpirun -np 8 -mca btl_sm_num_fifos 7 is much slower (and appeared to hang after several thousand interations) than -mca btl ^sm Is there another better way I should be modifying fifos to get better performance? Matt On Dec 11, 2009, at 4:04 AM, Terry Dontje wrote: >> >> Date: Thu, 10 Dec 2009 17:57:27 -0500 >> From: Jeff Squyres >> >> On Dec 10, 2009, at 5:53 PM, Gus Correa wrote: >> >> >>> > How does the efficiency of loopback >>> > (let's say, over TCP and over IB) compare with "sm"? >>> >> >> Definitely not as good; that's why we have sm. :-) I don't have any >> quantification of that assertion, though (i.e., no numbers to back that up). >> >> > However, as Eugene wrote earlier you can actually increase the number of > fifos used by the SM and avoid the hang that way. Unless you are really > strapped for memory I think that would be the best way to go. > > --td > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _ Matthew MacManes PhD Candidate University of California- Berkeley Museum of Vertebrate Zoology Phone: 510-495-5833 Lab Website: http://ib.berkeley.edu/labs/lacey Personal Website: http://macmanes.com/
Re: [OMPI users] Problem building OpenMPI with PGI compilers
Sorry -- I neglected to update the list yesterday: I got the RM approval and committed the fix to the v1.4 branch. So the PGI fix should be in last night's 1.4 snapshot. Could someone out in the wild give it a whirl and let me know if it works for you? (it works for *me*) On Dec 10, 2009, at 4:15 PM, Jeff Squyres (jsquyres) wrote: > On Dec 10, 2009, at 4:02 PM, Joshua Bernstein wrote: > > > > On Dec 9, 2009, at 4:36 PM, Jeff Squyres wrote: > > > Given that we haven't moved this patch to the v1.4 branch yet (i.e., it's > > > not > > > yet in a nightly v1.4 tarball), probably the easiest thing to do is to > > > apply > > > the attached patch to a v1.4 tarball. I tried it with my PGI 10.0 install > > > and it seems to work. So -- forget everything about autogen.sh and just > > > apply the attached patch. > > > > Is there a reason why it hasn't moved into 1.4 yet or wasn't included with > > the > > 1.4 release? > > 1.4 was *ONLY* about upgrading to Libtool 2.2.6b. > > The only reason it hasn't moved to the 1.4 branch yet is because Brian hadn't > reviewed my patch yet. :-) He reviewed it this afternoon, so it's just > awaiting release manager approval. > > > Can I toss my two cents in here and request it be made available in a > > mainline > > release, or at least in a snapshot sooner rather then later? I'd like to > > get it > > included in our build in time for our next release. > > I'll see if I can nudge everyone to get it into the branch today and > therefore into tonight's snapshot. > > -- > Jeff Squyres > jsquy...@cisco.com > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] OpenMPI 1.4 RPM Spec file problem
On Dec 9, 2009, at 4:47 PM, Jim Kusznir wrote: > One (on gcc only): the D_FORTIFY_SOURCE build failure. I've had to > move the if test "$using_gcc" = 0; then line down to after the > RPM_OPT_FLAGS= that includes D_FORTIFY_SOURCE; otherwise the compile > blows up. Hmm. Can you explain why / provide more detail? > The second, and in my opinion, more major rpm spec file bug is > something with the files specification. I build multiple versions of > OpenMPI to accomidate the collection of compilers I use (on this > machine, I have intel 10.1 and GCC, and will have to add 9.1 per user > request); on others, I use PGI and GCC. In any case, here's my build > command for Intel: > > CC=icc CXX=icpc F77=ifort FC=ifort rpmbuild -bb --define > 'install_in_opt 1' --define 'install_modulefile 1' --define > 'modules_rpm_name Modules' --define 'build_all_in_one_rpm 0' --define > 'configure_options --with-tm=/opt/torque' --define '_name > openmpi-intel' openmpi-1.4.spec > > Unfortunately, the filespec is somehow broke and it ends up missing > most (all?) the files, and failing in the final stage of RPM creation: > > --- > Processing files: openmpi-intel-docs-1.4-1 > Finding Provides: /usr/lib/rpm/find-provides openmpi-intel > Finding Requires: /usr/lib/rpm/find-requires openmpi-intel > Finding Supplements: /usr/lib/rpm/find-supplements openmpi-intel > Requires(rpmlib): rpmlib(PayloadFilesHavePrefix) <= 4.0-1 > rpmlib(CompressedFileNames) <= 3.0.4-1 > Requires: openmpi-intel-runtime > Checking for unpackaged file(s): /usr/lib/rpm/check-files > /var/tmp/openmpi-intel-1.4-1-root > error: Installed (but unpackaged) file(s) found: >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge >/opt/openmpi-intel/1.4/bin/mpiCC-vt >/opt/openmpi-intel/1.4/bin/mpic++-vt >/opt/openmpi-intel/1.4/bin/mpicc-vt >/opt/openmpi-intel/1.4/bin/mpicxx-vt >/opt/openmpi-intel/1.4/bin/mpif77-vt >/opt/openmpi-intel/1.4/bin/mpif90-vt >/opt/openmpi-intel/1.4/bin/ompi-checkpoint >/opt/openmpi-intel/1.4/bin/ompi-clean >/opt/openmpi-intel/1.4/bin/ompi-iof >/opt/openmpi-intel/1.4/bin/ompi-ps >/opt/openmpi-intel/1.4/bin/ompi-restart >/opt/openmpi-intel/1.4/bin/ompi-server >/opt/openmpi-intel/1.4/bin/opari >/opt/openmpi-intel/1.4/bin/orte-clean >/opt/openmpi-intel/1.4/bin/orte-iof >/opt/openmpi-intel/1.4/bin/orte-ps >/opt/openmpi-intel/1.4/bin/otfdecompress >/opt/openmpi-intel/1.4/bin/vtcc >/opt/openmpi-intel/1.4/bin/vtcxx >/opt/openmpi-intel/1.4/bin/vtf77 >/opt/openmpi-intel/1.4/bin/vtf90 >/opt/openmpi-intel/1.4/bin/vtfilter >/opt/openmpi-intel/1.4/bin/vtunify >/opt/openmpi-intel/1.4/etc/openmpi-default-hostfile >/opt/openmpi-intel/1.4/etc/openmpi-mca-params.conf >/opt/openmpi-intel/1.4/etc/openmpi-totalview.tcl >/opt/openmpi-intel/1.4/share/FILTER.SPEC >/opt/openmpi-intel/1.4/share/GROUPS.SPEC >/opt/openmpi-intel/1.4/share/METRICS.SPEC >/opt/openmpi-intel/1.4/share/vampirtrace/doc/ChangeLog >/opt/openmpi-intel/1.4/share/vampirtrace/doc/LICENSE >/opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.html >/opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.pdf >/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/ChangeLog >/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/LICENSE >/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/Readme.html >/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.pdf >/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.ps.gz >/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/opari-logo-100.gif >/opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/ChangeLog >/opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/LICENSE >/opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/otftools.pdf >/opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/specification.pdf >/opt/openmpi-intel/1.4/share/vtcc-wrapper-data.txt >/opt/openmpi-intel/1.4/share/vtcxx-wrapper-data.txt >/opt/openmpi-intel/1.4/share/vtf77-wrapper-data.txt >/opt/openmpi-intel/1.4/share/vtf90-wrapper-data.txt > > > RPM build errors: > Installed (but unpackaged) file(s) found: >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo >/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge >/opt/openmpi-intel/1.4/bin/mpiCC-vt >/opt/openmpi-intel/1.4/bin/mpic++-vt >/opt/openmpi-intel/1.4/bin/mpicc-vt >/opt/openmpi-intel/1.4/bin/mpicxx-vt
Re: [OMPI users] Problem building OpenMPI with PGI compilers
Jeff, Subject: Re: [OMPI users] Problem building OpenMPI with PGI compilers From: Jeff Squyres Date: Thu, 10 Dec 2009 10:20:32 -0500 To: Open MPI Users ... Actually, I was wrong. You *can't* just take the SVN trunk's autogen.sh and use it with a v1.4 tarball (for various uninteresting reasons). Given that we haven't moved this patch to the v1.4 branch yet (i.e., it's not yet in a nightly v1.4 tarball), probably the easiest thing to do is to apply the attached patch to a v1.4 tarball. I tried it with my PGI 10.0 install and it seems to work. So -- forget everything about autogen.sh and just apply the attached patch. Thanks; I was able to complete the make process using the provided patch. -- Best regards, David Turner User Services Groupemail: dptur...@lbl.gov NERSC Division phone: (510) 486-4027 Lawrence Berkeley Labfax: (510) 486-4316