Re: [OMPI users] Question on virtual memory allocated
On May 12, 2010, at 8:19 AM, Olivier Riff wrote: > What I do not understand is where the value of 2104m for the virtual memory > comes from. > When I add the value of Mem used (777848k) to the value of the cache > (339184k) : the amount is by far inferior to the Virtual memory (2104m). > Are here part of the memory allocated by the clients taken into account ? No, top only shows the data from one machine. > Where are physically allocated these 2104m of data ? They may be in physical memory and may also be swapped out on disk. Keep in mind that the virtual memory encompasses *all* memory for an application -- its code and its data. Hence, this also includes shared libraries (which may be shared amongst several processes on the same machine), process-specific instructions, process-specific data, and shared process data. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Open MPI 1.4.2 released
On May 13, 2010, at 3:20 PM, Aleksej Saushev wrote: > > - Various OpenBSD and NetBSD build and run-time fixes. Many thanks to > > the OpenBSD community for their time, expertise, and patience > > getting these fixes incorporated into Open MPI's main line. > > This didn't happen in 1.4.2, all patches we (NetBSD, not OpenBSD) > distribute for 1.4.1 still apply cleanly. Blast. Sorry about that; can you send a pointer to your patches? I seem to recall that we were amenable to some of the changes, but not all of them. ...looking through the SVN commit logs, we did commit *some* BSD-related things to the v1.4 tree. For example: https://svn.open-mpi.org/trac/ompi/changeset/22751 https://svn.open-mpi.org/trac/ompi/changeset/22890 https://svn.open-mpi.org/trac/ompi/changeset/22936 These fixes where what I was referring to in the changelog. Additionally, I just put some work in the development trunk that adds support for a new embedded software support library (hwloc), but, based on feedback from you and others, it allows you to configure one of 3 ways: * Use the embedded hwloc * Use an external hwloc installation * Don't compile hwloc at all I plan on extending this technique to libltdl, which should hopefully obviate at least some of your patches. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] openmpi + share points
Sorry for the delay in replying. It is probably much easier to NFS share the installation directory so that the exact same installation directory is available on all nodes. For example, if you installed OMPI into /opt/openmpi-1.4.2, then make /opt/openmpi-1.4.2 available on all nodes (even if they're mounted network shares). Can you try that? On May 10, 2010, at 9:04 AM, Christophe Peyret wrote: > Hello, > > I am building a cluster with 6 Apple xserve running OSX Server 10.6 : > > node1.cluster > node2.cluster > node3.cluster > node4.cluster > node5.cluster > node6.cluster > > I've intalled openmpi in directory /opt/openmpi-1.4.2 of node1 then I made a > share point of /opt -> /Network/opt and define variables > > export MPI_HOME=/Network/opt/openmpi-1.4.2 > export OPAL_PREFIX=/Network/opt/openmpi-1.4.2 > > I can access to openmpi from all nodes. However, I still face a problem when > I launch a computation > > mpirun --prefix /Network/opt/openmpi-1.4.2 -n 4 -hostfile ~peyret/hostfile > space64 -f Test/cfm56_hp_Rigid/cfm56_hp_Rigid.def -fast > > is returns me the error message : > > [node2.cluster:09163] mca: base: component_find: unable to open > /Network/opt/openmpi-1.4.2/lib/openmpi/mca_odls_default: file not found > (ignored) > [node4.cluster:08867] mca: base: component_find: unable to open > /Network/opt/openmpi-1.4.2/lib/openmpi/mca_odls_default: file not found > (ignored) > [node3.cluster:08880] mca: base: component_find: unable to open > /Network/opt/openmpi-1.4.2/lib/openmpi/mca_odls_default: file not found > (ignored) > > any idea ? > > Christophe > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] 'readv failed: Connection timed out' issue
On May 10, 2010, at 11:00 AM, Guanyinzhu wrote: > Did "--mca mpi_preconnect_all 1" work? > > I also face this problem "readv failed: connection time out" in the > production environment, and our engineer has reproduced this scenario at 20 > nodes with gigabye ethernet and limit one ethernet speed to 2MB/s, then a > MPI_Isend && MPI_Recv ring that means each node call MPI_Isend send data to > the next node and then call MPI_Recv recv data from the prior with large size > for many cycles, then we get the following error log: > [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv > failed: Connection timed out (110) FWIW, I just had a customer last week have these kinds of issues; in every case, he actually tracked the problem down to hardware issues (e.g., he swapped out ethernet cables and the problems went away). Keep in mind that Open MPI is simply reporting what the OS tells us. Specifically, Linux has decided to close the socket with a "timed out" error when we tried to read from it. > I thought it might because the network fd was set nonblocking, and the > nonblocking call of connect() might be error and the epoll_wait() was wake up > by the error but treat it as success and call > mca_btl_tcp_endpoint_recv_handler(), the nonblocking readv() call on a failed > connected fd, so it return -1, and set the errorno to 110 which means > connection timed out. Hmm. That's an interesting scenario; do you know that that is happening? But even if it is -- meaning that we're simply printing out the wrong error message -- the connect() shouldn't fail. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Segmentation fault at program end with 2+ processes
Ouch. These are the worst kinds of bugs to find. :-( If you attach a debugger to these processes and step through the final death throes of the process, does it provide any additional insight? I have not infrequently done stuff like this: { int i = 0; printf("Process %d ready to attach\n", getpid()); while (i == 0) sleep(5); } Then you get a message indicating which pid to attach to. When you attach, set the variable i to nonzero and you can continue stepping through the process. On May 14, 2010, at 10:44 AM, Paul-Michael Agapow wrote: > Apologies for the vague details of the problem I'm about to describe, > but then I only understand it vaguely. Any pointers about the best > directions for further investigation would be appreciated. Lengthy > details follow: > > So I'm "MPI-izing" a pre-existing C++ program (not mine) and have run > into some weird behaviour. When run under mpiexec, a segmentation > fault is thrown: > > % mpiexec -n 2 ./omegamip > [...] > main.cpp:52: Finished. > Completed 20 of 20 in 0.0695 minutes > [queen:23560] *** Process received signal *** > [queen:23560] Signal: Segmentation fault (11) > [queen:23560] Signal code: (128) > [queen:23560] Failing at address: (nil) > [queen:23560] [ 0] /lib64/libpthread.so.0 [0x3d6a00de80] > [queen:23560] [ 1] /opt/openmpi/lib/libopen-pal.so.0(_int_free+0x40) > [0x2afb1fa43460] > [queen:23560] [ 2] /opt/openmpi/lib/libopen-pal.so.0(free+0xbd) > [0x2afb1fa439ad] > [queen:23560] [ 3] ./omegamip(_ZN12omegaMapBaseD2Ev+0x5b) [0x433c2b] > [queen:23560] [ 4] ./omegamip(main+0x18c) [0x415ccc] > [queen:23560] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6941d8b4] > [queen:23560] [ 6] ./omegamip(__gxx_personality_v0+0x1e9) [0x40ee59] > [queen:23560] *** End of error message *** > mpiexec noticed that job rank 1 with PID 23560 on node > queen.bioinformatics exited on signal 11 (Segmentation fault). > > Right, so I've got a memory overrun or something. Except that when the > program is run in standalone mode, it works fine: > > % ./omegamip > [...] > main.cpp:52: Finished. > Completed 20 of 20 in 0.05970 minutes > > Right, so there's a difference between my standalone and MPI modes. > Except the the difference between my standalone and MPI versions is > currently nothing but the calls to MPI_Init, MPI_Finalize and some > exploratory calls to MPI_Comm_size and MPI_Comm_rank. (I haven't > gotten as far as coding the problem division.) Also, calling mpiexec > with 1 process always works: > > % mpiexec -n 1 ./omegamip > [...] > main.cpp:52: Finished. > Completed 20 of 20 in 0.05801 minutes > > So there's still this segmentation fault. Running valgrind across the > program doesn't show any obvious problems: there was some quirky > pointer arithmetic and some huge blocks of dangling memory, but these > were only leaked at the end of the program (i.e. the original > programmer didn't bother cleaning up at program termination). I've > caught most of those. But the segmentation fault still occurs only > when run under mpiexec with 2 or more processes. And by use of > diagnostic printfs and logging, I can see that it only occurs at the > very end of the program, the very end of main, possibly when > destructors are being automatically called. But again this cleanup > doesn't cause any problems with the standalone or 1 process modes. > > So, any ideas for where to start looking? > > technical details: gcc v4.1.2, C++, mpiexec (OpenRTE) 1.2.7, x86_64, > Red Hat 4.1.2-42 > > > Paul-Michael Agapow (paul-michael.agapow (at) hpa.org.uk) > Bioinformatics, Centre for Infections, Health Protection Agency > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_FILE_SET_ERRHANDLER returns an error withMPI_FILE_NULL
Yes, this is a bug. Thanks for identifying the issue! I have committed a fix to our development tree and have filed to have it moved into the 1.4 and 1.5 series. You can download a patch for the specific fix here: https://svn.open-mpi.org/trac/ompi/changeset/23145 On May 7, 2010, at 4:52 PM, Secretan Yves wrote: > Hello, > > According to my understanding of the documentation, it should be possible to > set the default error handler for files with MPI_FILE_SET_ERRHANDLER. > However, the following small Fortran77 program fails, MPI_FILE_SET_ERRHANDLER > returns an error. > > = > PROGRAM H2D2_MAIN > > INCLUDE 'mpif.h' > > EXTERNAL HNDLR > C > > CALL MPI_INIT(I_ERR) > > I_HDLR = 0 > CALL MPI_FILE_CREATE_ERRHANDLER(HNDLR1, I_HDLR, I_ERR) > WRITE(*,*) 'MPI_FILE_CREATE_ERRHANDLER: ', I_ERR > CALL MPI_FILE_SET_ERRHANDLER (MPI_FILE_NULL, I_HDLR, I_ERR) > WRITE(*,*) 'MPI_FILE_SET_ERRHANDLER: ', I_ERR > > END > > SUBROUTINE HNDLR(I_CNTX, I_ERR) > WRITE(*,*) 'In HNDLR: MPI Error detected' > RETURN > END > > > > Did I miss something obvious? > Regards > > Yves Secretan > Professeur > yves.secre...@ete.inrs.ca > > Avant d'imprimer, pensez à l'environnement > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] getc in openmpi
On May 12, 2010, at 1:01 PM, Fernando Lemos wrote: > On Wed, May 12, 2010 at 2:51 PM, Jeff Squyres wrote: >> On May 12, 2010, at 1:48 PM, Hanjun Kim wrote: >> >>> I am working on parallelizing my sequential program using OpenMPI. >>> Although I got performance speedup using many threads, there was >>> slowdown on a small number of threads like 4 threads. >>> I found that it is because getc worked much slower than sequential >>> version. Does OpenMPI override or wrap getc function? >> >> No. > > Please correct me if I'm wrong, but I believe OpenMPI sends > stdin/stdout from the other ranks back to rank 0 so that the output is > displayed as the stdin of mpirun and the other way around with > stdout/stderr. Otherwise it wouldn't be possible to even see the > output from the other ranks. I guess that could make things slower. > > MPICH-2 had a command line option that told mpiexec who would receive > stdin (all processes or only some of them) so that you could do things > like mpiexec distributing the contents of bigfile across the network. FWIW: OMPI has the same capability via cmd line args as shown by "mpirun -h" -stdin|--stdin Specify procs to receive stdin [rank, all, none] (default: 0, indicating rank 0) > > Regards, > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Question on virtual memory allocated
Thank you Jeff for your explaination. It is much clearer now. Best regards. Olivier 2010/5/15 Jeff Squyres > On May 12, 2010, at 8:19 AM, Olivier Riff wrote: > > > What I do not understand is where the value of 2104m for the virtual > memory comes from. > > When I add the value of Mem used (777848k) to the value of the cache > (339184k) : the amount is by far inferior to the Virtual memory (2104m). > > Are here part of the memory allocated by the clients taken into account ? > > No, top only shows the data from one machine. > > > Where are physically allocated these 2104m of data ? > > They may be in physical memory and may also be swapped out on disk. > > Keep in mind that the virtual memory encompasses *all* memory for an > application -- its code and its data. Hence, this also includes shared > libraries (which may be shared amongst several processes on the same > machine), process-specific instructions, process-specific data, and shared > process data. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] Enabling IPsec
Dear all, I would like to know if it is possible to enable IPsec in openMPI. I would like to enable it to do some measurements on the performance. Thanks in advance. Regards, Eisa Al Shamsi