date:20100515

Re: [OMPI users] Question on virtual memory allocated

2010-05-15 Thread Jeff Squyres

On May 12, 2010, at 8:19 AM, Olivier Riff wrote:

> What I do not understand is where the value of 2104m for the virtual memory 
> comes from.
> When I add the value of Mem used (777848k) to the value of the cache 
> (339184k) : the amount is by far inferior to the Virtual memory (2104m).
> Are here part of the memory allocated by the clients taken into account ?

No, top only shows the data from one machine.

> Where are physically allocated these 2104m of data ?

They may be in physical memory and may also be swapped out on disk.

Keep in mind that the virtual memory encompasses *all* memory for an 
application -- its code and its data.  Hence, this also includes shared 
libraries (which may be shared amongst several processes on the same machine), 
process-specific instructions, process-specific data, and shared process data.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Open MPI 1.4.2 released

2010-05-15 Thread Jeff Squyres

On May 13, 2010, at 3:20 PM, Aleksej Saushev wrote:

> > - Various OpenBSD and NetBSD build and run-time fixes.  Many thanks to
> >   the OpenBSD community for their time, expertise, and patience
> >   getting these fixes incorporated into Open MPI's main line.
> 
> This didn't happen in 1.4.2, all patches we (NetBSD, not OpenBSD)
> distribute for 1.4.1 still apply cleanly.

Blast.  Sorry about that; can you send a pointer to your patches?  I seem to 
recall that we were amenable to some of the changes, but not all of them.

...looking through the SVN commit logs, we did commit *some* BSD-related things 
to the v1.4 tree.  For example:

https://svn.open-mpi.org/trac/ompi/changeset/22751
https://svn.open-mpi.org/trac/ompi/changeset/22890
https://svn.open-mpi.org/trac/ompi/changeset/22936

These fixes where what I was referring to in the changelog.

Additionally, I just put some work in the development trunk that adds support 
for a new embedded software support library (hwloc), but, based on feedback 
from you and others, it allows you to configure one of 3 ways:

 * Use the embedded hwloc
 * Use an external hwloc installation
 * Don't compile hwloc at all

I plan on extending this technique to libltdl, which should hopefully obviate 
at least some of your patches.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] openmpi + share points

2010-05-15 Thread Jeff Squyres

Sorry for the delay in replying.

It is probably much easier to NFS share the installation directory so that the 
exact same installation directory is available on all nodes.  For example, if 
you installed OMPI into /opt/openmpi-1.4.2, then make /opt/openmpi-1.4.2 
available on all nodes (even if they're mounted network shares).

Can you try that?


On May 10, 2010, at 9:04 AM, Christophe Peyret wrote:

> Hello,
> 
> I am building a cluster with 6 Apple xserve running OSX Server 10.6 :
> 
> node1.cluster
> node2.cluster
> node3.cluster
> node4.cluster
> node5.cluster
> node6.cluster
> 
> I've intalled openmpi in directory /opt/openmpi-1.4.2 of node1 then I made a 
> share point of /opt -> /Network/opt and define variables
> 
> export MPI_HOME=/Network/opt/openmpi-1.4.2
> export OPAL_PREFIX=/Network/opt/openmpi-1.4.2
> 
> I can access to openmpi from all nodes. However, I still face a problem when 
> I launch a computation
> 
>  mpirun --prefix /Network/opt/openmpi-1.4.2 -n 4 -hostfile ~peyret/hostfile  
> space64 -f Test/cfm56_hp_Rigid/cfm56_hp_Rigid.def -fast
> 
> is returns me the error message :
> 
> [node2.cluster:09163] mca: base: component_find: unable to open 
> /Network/opt/openmpi-1.4.2/lib/openmpi/mca_odls_default: file not found 
> (ignored)
> [node4.cluster:08867] mca: base: component_find: unable to open 
> /Network/opt/openmpi-1.4.2/lib/openmpi/mca_odls_default: file not found 
> (ignored)
> [node3.cluster:08880] mca: base: component_find: unable to open 
> /Network/opt/openmpi-1.4.2/lib/openmpi/mca_odls_default: file not found 
> (ignored)
> 
> any idea ?
> 
> Christophe
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] 'readv failed: Connection timed out' issue

2010-05-15 Thread Jeff Squyres

On May 10, 2010, at 11:00 AM, Guanyinzhu wrote:

> Did "--mca mpi_preconnect_all 1" work?
>  
> I also face this problem "readv failed: connection time out" in the 
> production environment, and our engineer has reproduced this scenario at 20 
> nodes with gigabye ethernet and limit one ethernet speed to 2MB/s, then a 
> MPI_Isend && MPI_Recv ring that means each node call MPI_Isend send data to 
> the next node and then call MPI_Recv recv data from the prior with large size 
> for many cycles, then we get the following error log:
> [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv  
> failed: Connection timed out (110)

FWIW, I just had a customer last week have these kinds of issues; in every 
case, he actually tracked the problem down to hardware issues (e.g., he swapped 
out ethernet cables and the problems went away).

Keep in mind that Open MPI is simply reporting what the OS tells us.  
Specifically, Linux has decided to close the socket with a "timed out" error 
when we tried to read from it.  

> I thought it might because the network fd was set nonblocking, and the 
> nonblocking call of connect() might be error and the epoll_wait() was wake up 
> by the error but treat it as success and call 
> mca_btl_tcp_endpoint_recv_handler(), the nonblocking readv() call on a failed 
> connected fd, so it return -1, and set the errorno to 110 which means 
> connection timed out.

Hmm.  That's an interesting scenario; do you know that that is happening?

But even if it is -- meaning that we're simply printing out the wrong error 
message -- the connect() shouldn't fail.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Segmentation fault at program end with 2+ processes

2010-05-15 Thread Jeff Squyres

Ouch.  These are the worst kinds of bugs to find.  :-(

If you attach a debugger to these processes and step through the final death 
throes of the process, does it provide any additional insight?  I have not 
infrequently done stuff like this:

  {
 int i = 0;
 printf("Process %d ready to attach\n", getpid());
 while (i == 0) sleep(5);
  }

Then you get a message indicating which pid to attach to.  When you attach, set 
the variable i to nonzero and you can continue stepping through the process.



On May 14, 2010, at 10:44 AM, Paul-Michael Agapow wrote:

> Apologies for the vague details of the problem I'm about to describe,
> but then I only understand it vaguely. Any pointers about the best
> directions for further investigation would be appreciated. Lengthy
> details follow:
> 
> So I'm "MPI-izing" a pre-existing C++ program (not mine) and have run
> into some weird behaviour. When run under mpiexec, a segmentation
> fault is thrown:
> 
> % mpiexec -n 2 ./omegamip
> [...]
> main.cpp:52: Finished.
> Completed 20 of 20 in 0.0695 minutes
> [queen:23560] *** Process received signal ***
> [queen:23560] Signal: Segmentation fault (11)
> [queen:23560] Signal code:  (128)
> [queen:23560] Failing at address: (nil)
> [queen:23560] [ 0] /lib64/libpthread.so.0 [0x3d6a00de80]
> [queen:23560] [ 1] /opt/openmpi/lib/libopen-pal.so.0(_int_free+0x40)
> [0x2afb1fa43460]
> [queen:23560] [ 2] /opt/openmpi/lib/libopen-pal.so.0(free+0xbd) 
> [0x2afb1fa439ad]
> [queen:23560] [ 3] ./omegamip(_ZN12omegaMapBaseD2Ev+0x5b) [0x433c2b]
> [queen:23560] [ 4] ./omegamip(main+0x18c) [0x415ccc]
> [queen:23560] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6941d8b4]
> [queen:23560] [ 6] ./omegamip(__gxx_personality_v0+0x1e9) [0x40ee59]
> [queen:23560] *** End of error message ***
> mpiexec noticed that job rank 1 with PID 23560 on node
> queen.bioinformatics exited on signal 11 (Segmentation fault).
> 
> Right, so I've got a memory overrun or something. Except that when the
> program is run in standalone mode, it works fine:
> 
> % ./omegamip
> [...]
> main.cpp:52: Finished.
> Completed 20 of 20 in 0.05970 minutes
> 
> Right, so there's a difference between my standalone and MPI modes.
> Except the the difference between my standalone and MPI versions is
> currently nothing but the calls to MPI_Init, MPI_Finalize and some
> exploratory calls to MPI_Comm_size and MPI_Comm_rank. (I haven't
> gotten as far as coding the problem division.) Also, calling mpiexec
> with 1 process always works:
> 
> % mpiexec -n 1 ./omegamip
> [...]
> main.cpp:52: Finished.
> Completed 20 of 20 in 0.05801 minutes
> 
> So there's still this segmentation fault. Running valgrind across the
> program doesn't show any obvious problems: there was some quirky
> pointer arithmetic and some huge blocks of dangling memory, but these
> were only leaked at the end of the program (i.e. the original
> programmer didn't bother cleaning up at program termination). I've
> caught most of those. But the segmentation fault still occurs only
> when run under mpiexec with 2 or more processes. And by use of
> diagnostic printfs and logging, I can see that it only occurs at the
> very end of the program, the very end of main, possibly when
> destructors are being automatically called. But again this cleanup
> doesn't cause any problems with the standalone or 1 process modes.
> 
> So, any ideas for where to start looking?
> 
> technical details: gcc v4.1.2, C++, mpiexec (OpenRTE) 1.2.7, x86_64,
> Red Hat 4.1.2-42
> 
> 
> Paul-Michael Agapow (paul-michael.agapow (at) hpa.org.uk)
> Bioinformatics, Centre for Infections, Health Protection Agency
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] MPI_FILE_SET_ERRHANDLER returns an error withMPI_FILE_NULL

2010-05-15 Thread Jeff Squyres

Yes, this is a bug.  Thanks for identifying the issue!

I have committed a fix to our development tree and have filed to have it moved 
into the 1.4 and 1.5 series.  You can download a patch for the specific fix 
here:

https://svn.open-mpi.org/trac/ompi/changeset/23145


On May 7, 2010, at 4:52 PM, Secretan Yves wrote:

> Hello,
>  
> According to my understanding of the documentation, it should be possible to 
> set the default error handler for files with MPI_FILE_SET_ERRHANDLER. 
> However, the following small Fortran77 program fails, MPI_FILE_SET_ERRHANDLER 
> returns an error.
>  
> =
>   PROGRAM H2D2_MAIN
>  
>   INCLUDE 'mpif.h'
>  
>   EXTERNAL HNDLR
> C
>  
>   CALL MPI_INIT(I_ERR)
>  
>   I_HDLR = 0
>   CALL MPI_FILE_CREATE_ERRHANDLER(HNDLR1, I_HDLR, I_ERR)
>   WRITE(*,*) 'MPI_FILE_CREATE_ERRHANDLER: ', I_ERR
>   CALL MPI_FILE_SET_ERRHANDLER   (MPI_FILE_NULL, I_HDLR, I_ERR)
>   WRITE(*,*) 'MPI_FILE_SET_ERRHANDLER: ', I_ERR
>  
>   END
>  
>   SUBROUTINE HNDLR(I_CNTX, I_ERR)
>   WRITE(*,*) 'In HNDLR: MPI Error detected'
>   RETURN
>   END
> 
>  
>  
> Did I miss something obvious?
> Regards
>  
> Yves Secretan
> Professeur
> yves.secre...@ete.inrs.ca
>  
>  Avant d'imprimer, pensez à l'environnement
>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] getc in openmpi

2010-05-15 Thread Ralph Castain


On May 12, 2010, at 1:01 PM, Fernando Lemos wrote:

> On Wed, May 12, 2010 at 2:51 PM, Jeff Squyres  wrote:
>> On May 12, 2010, at 1:48 PM, Hanjun Kim wrote:
>> 
>>> I am working on parallelizing my sequential program using OpenMPI.
>>> Although I got performance speedup using many threads, there was
>>> slowdown on a small number of threads like 4 threads.
>>> I found that it is because getc worked much slower than sequential
>>> version. Does OpenMPI override or wrap getc function?
>> 
>> No.
> 
> Please correct me if I'm wrong, but I believe OpenMPI sends
> stdin/stdout from the other ranks back to rank 0 so that the output is
> displayed as the stdin of mpirun and the other way around with
> stdout/stderr. Otherwise it wouldn't be possible to even see the
> output from the other ranks. I guess that could make things slower.
> 
> MPICH-2  had a command line option that told mpiexec who would receive
> stdin (all processes or only some of them) so that you could do things
> like mpiexec  distributing the contents of bigfile across the network.

FWIW: OMPI has the same capability via cmd line args as shown by "mpirun -h"

   -stdin|--stdin   
 Specify procs to receive stdin [rank, all, none]
 (default: 0, indicating rank 0)


> 
> Regards,
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Question on virtual memory allocated

2010-05-15 Thread Olivier Riff

Thank you Jeff for your explaination. It is much clearer now.

Best regards.

Olivier

2010/5/15 Jeff Squyres 

> On May 12, 2010, at 8:19 AM, Olivier Riff wrote:
>
> > What I do not understand is where the value of 2104m for the virtual
> memory comes from.
> > When I add the value of Mem used (777848k) to the value of the cache
> (339184k) : the amount is by far inferior to the Virtual memory (2104m).
> > Are here part of the memory allocated by the clients taken into account ?
>
> No, top only shows the data from one machine.
>
> > Where are physically allocated these 2104m of data ?
>
> They may be in physical memory and may also be swapped out on disk.
>
> Keep in mind that the virtual memory encompasses *all* memory for an
> application -- its code and its data.  Hence, this also includes shared
> libraries (which may be shared amongst several processes on the same
> machine), process-specific instructions, process-specific data, and shared
> process data.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] Enabling IPsec

2010-05-15 Thread awwase

Dear all,

I would like to know if it is possible to enable IPsec in openMPI. I would
like to enable it to do some measurements on the performance. Thanks in
advance.

Regards,

Eisa Al Shamsi

Re: [OMPI users] Question on virtual memory allocated

Re: [OMPI users] Open MPI 1.4.2 released

Re: [OMPI users] openmpi + share points

Re: [OMPI users] 'readv failed: Connection timed out' issue

Re: [OMPI users] Segmentation fault at program end with 2+ processes

Re: [OMPI users] MPI_FILE_SET_ERRHANDLER returns an error withMPI_FILE_NULL

Re: [OMPI users] getc in openmpi

Re: [OMPI users] Question on virtual memory allocated

[OMPI users] Enabling IPsec

9 matches

Site Navigation

Mail list logo

Footer information