Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-11 Thread Terry Dontje


Date: Thu, 10 Dec 2009 17:57:27 -0500
From: Jeff Squyres 

On Dec 10, 2009, at 5:53 PM, Gus Correa wrote:

  

> How does the efficiency of loopback
> (let's say, over TCP and over IB) compare with "sm"?



Definitely not as good; that's why we have sm.   :-)   I don't have any 
quantification of that assertion, though (i.e., no numbers to back that up).

  
However, as Eugene wrote earlier you can actually increase the number of 
fifos used by the SM and avoid the hang that way.  Unless you are really 
strapped for memory I think that would be the best way to go.


--td



Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-11 Thread Sergio Díaz

Hi Josh

Here you go the file.

I will try to apply the trunk but I think that I broke-up my openmpi 
installation doing "something" and I don't know what :-( . I was 
modifying the mca parameters...
When I send a job, the orted daemon expanded in the SLAVE host is 
launched in a bucle till they spend all the reserved memory.
It is very strange so I will compile it again, I will reproduce the bug 
and then I will test the trunk.


Thanks a lot for the support and tickets opened.
Sergio


sdiaz30279  0.0  0.0  1888  560 ?Ds   12:54   0:00  \_ 
/opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter 
/opt/cesga/sge62/default/spool/compute
sdiaz30286  0.0  0.0 52772 1188 ?D12:54   0:00  
\_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted -mca ess env -mca 
orte_ess_jobid 219
sdiaz30322  0.0  0.0 52772 1188 ?S12:54   
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30358  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30394  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash /opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30430  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30466  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30502  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30538  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted
sdiaz30574  0.0  0.0 52772 1188 ?D12:54   
0:00  \_ /bin/bash 
/opt/cesga/openmpi-1.3.3/bin/orted





Josh Hursey escribió:


On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:


Hi Josh,

You were right. The main problem was the /tmp. SGE uses a scratch 
directory in which the jobs have temporary files. Setting TMPDIR to 
/tmp, checkpoint works!
However, when I try to restart it... I got the following error (see 
ERROR1). Option -v agrees these lines (see ERRO2).


It is concerning that ompi-restart is segfault'ing when it errors out. 
The error message is being generated between the launch of the 
opal-restart starter command and when we try to exec(cr_restart). 
Usually the failure is related to a corruption of the metadata stored 
in the checkpoint.


Can you send me the file below:
 ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data 



I was able to reproduce the segv (at least I think it is the same 
one). We failed to check the validity of a string when we parse the 
metadata. I committed a fix to the trunk in r22290, and requested that 
the fix be moved to the v1.4 and v1.5 branches. If you are interested 
in seeing when they get applied you can follow the following tickets:

  https://svn.open-mpi.org/trac/ompi/ticket/2140
  https://svn.open-mpi.org/trac/ompi/ticket/2141

Can you try the trunk to see if the problem goes away? The development 
trunk and v1.5 series have a bunch of improvements to the C/R 
functionality that were never brought over the v1.3/v1.4 series.




I was trying to use ssh instead of rsh but I was impossible. By 
default it should use ssh and if it finds a problem, it will use rsh. 
It seems that ssh doesn't work because always use rsh.

If I change this MCA parameter, It still uses rsh.
If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to use 
ssh and doesn't works. I got --> "bash: orted: command not found" and 
the mpi process dies.
The command which try to execute is the following and I haven't found 
yet the reason why this command doesn't found orted because I set the 
/etc/bashrc in order to get always the right path and I have the 
right path into my application. (see ERROR4).


This seems like an SGE specific issue, so a bit out of my domain. 
Maybe others have suggestions here.


-- Josh




Many thanks!,
Sergio

P.S. Sorry about these long emails. I just try to show you useful 
information to identify my problems.



ERROR 1
> 


> [sdiaz@compute-3-18 ~]$ ompi-restart ompi_global_snapshot_28454.ckpt
> 
-- 


> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
> 
-- 

> 
-- 


> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> 
-

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-11 Thread Matthew MacManes
On my system,  mpirun -np 8 -mca btl_sm_num_fifos 7 is much slower (and 
appeared to hang after several thousand interations) than -mca btl ^sm

Is there another better way I should be modifying fifos to get better 
performance?

Matt



On Dec 11, 2009, at 4:04 AM, Terry Dontje wrote:

>> 
>> Date: Thu, 10 Dec 2009 17:57:27 -0500
>> From: Jeff Squyres 
>> 
>> On Dec 10, 2009, at 5:53 PM, Gus Correa wrote:
>> 
>>  
>>> > How does the efficiency of loopback
>>> > (let's say, over TCP and over IB) compare with "sm"?
>>>
>> 
>> Definitely not as good; that's why we have sm.   :-)   I don't have any 
>> quantification of that assertion, though (i.e., no numbers to back that up).
>> 
>>  
> However, as Eugene wrote earlier you can actually increase the number of 
> fifos used by the SM and avoid the hang that way.  Unless you are really 
> strapped for memory I think that would be the best way to go.
> 
> --td
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_
Matthew MacManes
PhD Candidate
University of California- Berkeley
Museum of Vertebrate Zoology
Phone: 510-495-5833
Lab Website: http://ib.berkeley.edu/labs/lacey
Personal Website: http://macmanes.com/








Re: [OMPI users] Problem building OpenMPI with PGI compilers

2009-12-11 Thread Jeff Squyres
Sorry -- I neglected to update the list yesterday: I got the RM approval and 
committed the fix to the v1.4 branch.  So the PGI fix should be in last night's 
1.4 snapshot.

Could someone out in the wild give it a whirl and let me know if it works for 
you?  (it works for *me*)



On Dec 10, 2009, at 4:15 PM, Jeff Squyres (jsquyres) wrote:

> On Dec 10, 2009, at 4:02 PM, Joshua Bernstein wrote:
> 
> > > On Dec 9, 2009, at 4:36 PM, Jeff Squyres wrote:
> > > Given that we haven't moved this patch to the v1.4 branch yet (i.e., it's 
> > > not
> > > yet in a nightly v1.4 tarball), probably the easiest thing to do is to 
> > > apply
> > > the attached patch to a v1.4 tarball.  I tried it with my PGI 10.0 install
> > > and it seems to work.  So -- forget everything about autogen.sh and just
> > > apply the attached patch.
> >
> > Is there a reason why it hasn't moved into 1.4 yet or wasn't included with 
> > the
> > 1.4 release?
> 
> 1.4 was *ONLY* about upgrading to Libtool 2.2.6b.
> 
> The only reason it hasn't moved to the 1.4 branch yet is because Brian hadn't 
> reviewed my patch yet.  :-)  He reviewed it this afternoon, so it's just 
> awaiting release manager approval.
> 
> > Can I toss my two cents in here and request it be made available in a 
> > mainline
> > release, or at least in a snapshot sooner rather then later? I'd like to 
> > get it
> > included in our build in time for our next release.
> 
> I'll see if I can nudge everyone to get it into the branch today and 
> therefore into tonight's snapshot.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] OpenMPI 1.4 RPM Spec file problem

2009-12-11 Thread Jeff Squyres
On Dec 9, 2009, at 4:47 PM, Jim Kusznir wrote:

> One (on gcc only): the D_FORTIFY_SOURCE build failure.  I've had to
> move the if test "$using_gcc" = 0; then line down to after the
> RPM_OPT_FLAGS= that includes D_FORTIFY_SOURCE; otherwise the compile
> blows up.

Hmm.  Can you explain why / provide more detail?

> The second, and in my opinion, more major rpm spec file bug is
> something with the files specification.  I build multiple versions of
> OpenMPI to accomidate the collection of compilers I use (on this
> machine, I have intel 10.1 and GCC, and will have to add 9.1 per user
> request); on others, I use PGI and GCC.  In any case, here's my build
> command for Intel:
> 
> CC=icc CXX=icpc F77=ifort FC=ifort rpmbuild -bb --define
> 'install_in_opt 1' --define 'install_modulefile 1' --define
> 'modules_rpm_name Modules' --define 'build_all_in_one_rpm 0'  --define
> 'configure_options --with-tm=/opt/torque' --define '_name
> openmpi-intel' openmpi-1.4.spec
> 
> Unfortunately, the filespec is somehow broke and it ends up missing
> most (all?) the files, and failing in the final stage of RPM creation:
> 
> ---
> Processing files: openmpi-intel-docs-1.4-1
> Finding  Provides: /usr/lib/rpm/find-provides openmpi-intel
> Finding  Requires: /usr/lib/rpm/find-requires openmpi-intel
> Finding  Supplements: /usr/lib/rpm/find-supplements openmpi-intel
> Requires(rpmlib): rpmlib(PayloadFilesHavePrefix) <= 4.0-1
> rpmlib(CompressedFileNames) <= 3.0.4-1
> Requires: openmpi-intel-runtime
> Checking for unpackaged file(s): /usr/lib/rpm/check-files
> /var/tmp/openmpi-intel-1.4-1-root
> error: Installed (but unpackaged) file(s) found:
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge
>/opt/openmpi-intel/1.4/bin/mpiCC-vt
>/opt/openmpi-intel/1.4/bin/mpic++-vt
>/opt/openmpi-intel/1.4/bin/mpicc-vt
>/opt/openmpi-intel/1.4/bin/mpicxx-vt
>/opt/openmpi-intel/1.4/bin/mpif77-vt
>/opt/openmpi-intel/1.4/bin/mpif90-vt
>/opt/openmpi-intel/1.4/bin/ompi-checkpoint
>/opt/openmpi-intel/1.4/bin/ompi-clean
>/opt/openmpi-intel/1.4/bin/ompi-iof
>/opt/openmpi-intel/1.4/bin/ompi-ps
>/opt/openmpi-intel/1.4/bin/ompi-restart
>/opt/openmpi-intel/1.4/bin/ompi-server
>/opt/openmpi-intel/1.4/bin/opari
>/opt/openmpi-intel/1.4/bin/orte-clean
>/opt/openmpi-intel/1.4/bin/orte-iof
>/opt/openmpi-intel/1.4/bin/orte-ps
>/opt/openmpi-intel/1.4/bin/otfdecompress
>/opt/openmpi-intel/1.4/bin/vtcc
>/opt/openmpi-intel/1.4/bin/vtcxx
>/opt/openmpi-intel/1.4/bin/vtf77
>/opt/openmpi-intel/1.4/bin/vtf90
>/opt/openmpi-intel/1.4/bin/vtfilter
>/opt/openmpi-intel/1.4/bin/vtunify
>/opt/openmpi-intel/1.4/etc/openmpi-default-hostfile
>/opt/openmpi-intel/1.4/etc/openmpi-mca-params.conf
>/opt/openmpi-intel/1.4/etc/openmpi-totalview.tcl
>/opt/openmpi-intel/1.4/share/FILTER.SPEC
>/opt/openmpi-intel/1.4/share/GROUPS.SPEC
>/opt/openmpi-intel/1.4/share/METRICS.SPEC
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/ChangeLog
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/LICENSE
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.html
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/UserManual.pdf
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/ChangeLog
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/LICENSE
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/Readme.html
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.pdf
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/lacsi01.ps.gz
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/opari/opari-logo-100.gif
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/ChangeLog
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/LICENSE
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/otftools.pdf
>/opt/openmpi-intel/1.4/share/vampirtrace/doc/otf/specification.pdf
>/opt/openmpi-intel/1.4/share/vtcc-wrapper-data.txt
>/opt/openmpi-intel/1.4/share/vtcxx-wrapper-data.txt
>/opt/openmpi-intel/1.4/share/vtf77-wrapper-data.txt
>/opt/openmpi-intel/1.4/share/vtf90-wrapper-data.txt
> 
> 
> RPM build errors:
> Installed (but unpackaged) file(s) found:
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfaux
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfcompress
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfconfig
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfdump
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfinfo
>/opt/openmpi-intel/1.4/bin/ia64-suse-linux-otfmerge
>/opt/openmpi-intel/1.4/bin/mpiCC-vt
>/opt/openmpi-intel/1.4/bin/mpic++-vt
>/opt/openmpi-intel/1.4/bin/mpicc-vt
>/opt/openmpi-intel/1.4/bin/mpicxx-vt

Re: [OMPI users] Problem building OpenMPI with PGI compilers

2009-12-11 Thread David Turner

Jeff,


Subject: Re: [OMPI users] Problem building OpenMPI with PGI compilers
From: Jeff Squyres 
Date: Thu, 10 Dec 2009 10:20:32 -0500
To: Open MPI Users 

...

Actually, I was wrong.  You *can't* just take the SVN trunk's autogen.sh and 
use it with a v1.4 tarball (for various uninteresting reasons).

Given that we haven't moved this patch to the v1.4 branch yet (i.e., it's not 
yet in a nightly v1.4 tarball), probably the easiest thing to do is to apply 
the attached patch to a v1.4 tarball.  I tried it with my PGI 10.0 install and 
it seems to work.  So -- forget everything about autogen.sh and just apply the 
attached patch.


Thanks; I was able to complete the make process using the provided
patch.

--
Best regards,

David Turner
User Services Groupemail: dptur...@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Labfax: (510) 486-4316