Re: [OMPI users] General question on the implementation of a"scheduler" on client side...
Hello Jeff, thanks for your detailed answer. 2010/5/20 Jeff Squyres > You're basically talking about implementing some kind of > application-specific protocol. A few tips that may help in your design: > > 1. Look into MPI_Isend / MPI_Irecv for non-blocking sends and receives. > These may be particularly useful in the server side, so that it can do > other stuff while sends and receives are progressing. > -> You are definitively right, I have to switch to non blocking sends and receives. > > 2. You probably already noticed that collective operations (broadcasts and > the link) need to be invoked by all members of the communicator. So if you > want to do a broadcast, everyone needs to know. That being said, you can > send a short message to everyone alerting them that a longer broadcast is > coming -- then they can execute MPI_BCAST, etc. That works best if your > broadcasts are large messages (i.e., you benefit from scalable > implementations of broadcast) -- otherwise you're individually sending short > messages followed by a short broadcast. There might not be much of a "win" > there. > -> That is what I was thinking about to implement. As you mentioned, and specifically for my case where I mainly send short messages, there might not be much win. By the way, are their some benchmarks testing sequential MPI_ISend versus MPI_BCAST for instance ? The aim is to determine up which buffer size a MPI_BCast is worth to be used for my case. You can answer that the test is easy to do and that I can test it by myself :) > > 3. FWIW, the MPI Forum has introduced the concept of non-blocking > collective operations for the upcoming MPI-3 spec. These may help; google > for libnbc for a (non-optimized) implementation that may be of help to you. > MPI implementations (like Open MPI) will feature non-blocking collectives > someday in the future. > > -> Interesting to know and to keep in mind. Sometimes the future is really near. Thanks again for your answer and info. Olivier > On May 20, 2010, at 5:30 AM, Olivier Riff wrote: > > > Hello, > > > > I have a general question about the best way to implement an openmpi > application, i.e the design of the application. > > > > A machine (I call it the "server") should send to a cluster containing a > lot of processors (the "clients") regularly task to do (byte buffers from > very various size). > > The server should send to each client a different buffer, then wait for > each client answers (buffer sent by each client after some processing), and > retrieve the result data. > > > > First I made something looking like this. > > On the server side: Send sequentially to each client buffers using > MPI_Send. > > On each client side: loop which waits a buffer using MPI_Recv, then > process the buffer and sends the result using MPI_Send > > This is really not efficient because a lot of time is lost due to the > fact that the server sends and receives sequentially the buffers. > > It only has the advantage to have on the client size a pretty easy > scheduler: > > Wait for buffer (MPI_Recv) -> Analyse it -> Send result (MPI_Send) > > > > My wish is to mix MPI_Send/MPI_Recv and other mpi functions like > MPI_BCast/MPI_Scatter/MPI_Gather... (like I imagine every mpi application > does). > > The problem is that I cannot find a easy solution in order that each > client knows which kind of mpi function is currently called by the server. > If the server calls MPI_BCast the client should do the same. Sending at each > time a first message to indicate the function the server will call next does > not look very nice. Though I do not see an easy/best way to implement an > "adaptative" scheduler on the client side. > > > > Any tip, advice, help would be appreciate. > > > > > > Thanks, > > > > Olivier > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] An error occured in MPI_Bcast; MPI_ERR_TYPE: invalid datatype
Hi folks, openMPI 1.4.1 seems to have another problem with my machine, or something on it. This little program here (compiled with mpif90) startet with mpiexec -np 4 a.out produces the following output: Suriprisingly the same thing written in C-Code (compiled with mpiCC) works without a problem. May it be a interference with other MPI-distributions although I think I have deleted all? Note: The error occurs also with my climate model. The error is nearly the same, only with MPI_ERR_TYPE: invalid root. I've compiled openMPI not as root root, but in my home-directory. Thanks for your advice, Klaus My machine: > OpenMPI-version 1.4.1 compiled with Lahey Fortran 95 (lf95). > OpenMPI was compiled "out of the box" only changing to the Lahey compiler > with a setenv $FC lf95 > > The System: Linux marvin 2.6.27.6-1 #1 SMP Sat Nov 15 20:19:04 CET 2008 > x86_64 GNU/Linux > > Compiler: Lahey/Fujitsu Linux64 Fortran Compiler Release L8.10a *** Output: [marvin:21997] *** An error occurred in MPI_Bcast [marvin:21997] *** on communicator MPI_COMM_WORLD [marvin:21997] *** MPI_ERR_TYPE: invalid datatype [marvin:21997] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) Process 1 : k= 10 before -- mpiexec has exited due to process rank 1 with PID 21997 on node marvin exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpiexec (as reported here). -- [marvin:21993] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [marvin:21993] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Process 3 : k= 10 before Program Fortran90: include 'mpif.h' integer k, rank, size, ierror, tag, p call MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) if (rank == 0) then k = 20 else k = 10 end if do p= 0,size,1 if (rank == p) then print*, 'Process', p,': k=', k, 'before' end if enddo call MPI_Bcast(k, 1, MPI_INT,0,MPI_COMM_WORLD) do p =0,size,1 if (rank == p) then print*, 'Process', p, ': k=', k, 'after' end if enddo call MPI_Finalize(ierror) end Program C-Code: #include #include int main (int argc, char *argv[]) { int k,id,p,size; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &id); MPI_Comm_size(MPI_COMM_WORLD, &size); if(id == 0) k = 20; else k = 10; for(p=0; p
Re: [OMPI users] GM + OpenMPI bug ...
Hi, We have used the lspci -vvxxx and we have obtained: bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit Ethernet Controller (Copper) (rev 02) bi00: Subsystem: Intel Corporation PRO/1000 XT Server Adapter bi00: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- bi00: Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- bi00: Latency: 64 (63750ns min), Cache Line Size: 64 bytes bi00: Interrupt: pin A routed to IRQ 185 bi00: Region 0: Memory at fe9e (64-bit, non-prefetchable) [size=128K] bi00: Region 2: Memory at fe9d (64-bit, non-prefetchable) [size=64K] bi00: Region 4: I/O ports at dc80 [size=32] bi00: Expansion ROM at fe9c [disabled] [size=64K] bi00: Capabilities: [dc] Power Management version 2 bi00: Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0 +,D1-,D2-,D3hot+,D3cold-) bi00: Status: D0 PME-Enable- DSel=0 DScale=1 PME- bi00: Capabilities: [e4] PCI-X non-bridge device bi00: Command: DPERE- ERO+ RBC=512 OST=1 bi00: Status: Dev=04:01.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- bi00: Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- bi00: Address: Data: bi00: 00: 86 80 08 10 17 01 30 02 02 00 00 02 10 40 00 00 bi00: 10: 04 00 9e fe 00 00 00 00 04 00 9d fe 00 00 00 00 bi00: 20: 81 dc 00 00 00 00 00 00 00 00 00 00 86 80 07 11 bi00: 30: 00 00 9c fe dc 00 00 00 00 00 00 00 05 01 ff 00 bi00: 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bi00: 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bi00: 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bi00: 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bi00: 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bi00: 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bi00: a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bi00: b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bi00: c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bi00: d0: 00 00 00 00 00 00 00 00 00 00 00 00 01 e4 22 48 bi00: e0: 00 20 00 40 07 f0 02 00 08 04 43 04 00 00 00 00 bi00: f0: 05 00 80 00 00 00 00 00 00 00 00 00 00 00 00 00 We don't know how to interpret this information. We suppose that SEER and PERR are not activated, if we have understood correctly the Status " ... >SERR- Could you confirm that? If this is the case, could you indicate how to activate them? Additionally, if you know any software tool or methodology to check the hardware/software, please, could you send us how to do it? Thanks in advance. Best regards, José I. Aliaga El 20/05/2010, a las 16:29, Patrick Geoffray escribió: Hi Jose, On 5/12/2010 10:57 PM, Jos? Ignacio Aliaga Estell?s wrote: I think that I have found a bug on the implementation of GM collectives routines included in OpenMPI. The version of the GM software is 2.0.30 for the PCI64 cards. I obtain the same problems when I use the 1.4.1 or the 1.4.2 version. Could you help me? Thanks. We have been running the test you provided on 8 nodes for 4 hours and haven't seen any errors. The setup used GM 2.0.30 and openmpi 1.4.2 on PCI-X cards (M3F-PCIXD-2 aka 'D' cards). We do not have PCI64 NICs anymore, and no machines with a PCI 64/66 slot. One-bit errors are rarely a software problem, they are usually linked to hardware corruption. Old PCI has a simple parity check but most machines/BIOS of this era ignored reported errors. You may want to check the lspci output on your machines and see if SERR or PERR is set. You can also try to reset each NIC in its PCI slot, or use a different slot if available. Hope it helps. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com
[OMPI users] [sge::tight-integration] slot scheduling and resources handling
Hi there, I'm observing something strange on our cluster managed by SGE6.2u4 when launching a parallel computation on several nodes, using OpenMPI/SGE tight- integration mode (OpenMPI-1.3.3). It seems that the SGE allocated slots are not used by OpenMPI, as if OpenMPI was doing is own round-robin allocation based on the allocated node hostnames. Here is what I'm doing: - launch a parallel computation involving 8 processors, using for each of them 14GB of memory. I'm using a qsub command where i request memory_free resource and use tight integration with openmpi - 3 servers are available: . barney with 4 cores (4 slots) and 32GB . carl with 4 cores (4 slots) and 32GB . charlie with 8 cores (8 slots) and 64GB Here is the output of the allocated nodes (OpenMPI output): == ALLOCATED NODES == Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 Daemon: [[44332,0],0] Daemon launched: True Num slots: 4 Slots in use: 0 Num slots allocated: 4 Max slots: 0 Username on node: NULL Num procs: 0 Next node_rank: 0 Data for node: Name: carl.fftLaunch id: -1 Arch: 0 State: 2 Daemon: Not defined Daemon launched: False Num slots: 2 Slots in use: 0 Num slots allocated: 2 Max slots: 0 Username on node: NULL Num procs: 0 Next node_rank: 0 Data for node: Name: barney.fftLaunch id: -1 Arch: 0 State: 2 Daemon: Not defined Daemon launched: False Num slots: 2 Slots in use: 0 Num slots allocated: 2 Max slots: 0 Username on node: NULL Num procs: 0 Next node_rank: 0 = Here is what I see when my computation is running on the cluster: # rank pid hostname 0 28112 charlie 1 11417 carl 2 11808 barney 3 28113 charlie 4 11418 carl 5 11809 barney 6 28114 charlie 7 11419 carl Note that -the parallel environment used under SGE is defined as: [eg@moe:~]$ qconf -sp round_robin pe_nameround_robin slots 32 user_lists NONE xuser_listsNONE start_proc_args/bin/true stop_proc_args /bin/true allocation_rule$round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE I'm wondering why OpenMPI didn't use the allocated nodes chosen by SGE (cf. "ALLOCATED NODES" report) but instead allocate each job of the parallel computation at a time, using a round-robin method. Note that I'm using the '--bynode' option in the orterun command line. If the behavior I'm observing is simply the consequence of using this option, please let me know. This would eventually mean that one need to state that SGE tight- integration has a lower priority on orterun behavior than the different command line options. Any help would be appreciated, Thanks, Eloi -- Eloi Gaudry Free Field Technologies Axis Park Louvain-la-Neuve Rue Emile Francqui, 1 B-1435 Mont-Saint Guibert BELGIUM Company Phone: +32 10 487 959 Company Fax: +32 10 454 626
Re: [OMPI users] Using a rankfile for ompi-restart
On Tue, May 18, 2010 at 3:53 PM, Josh Hursey wrote: >> I've noticed that ompi-restart doesn't support the --rankfile option. >> It only supports --hostfile/--machinefile. Is there any reason >> --rankfile isn't supported? >> >> Suppose you have a cluster without a shared file system. When one node >> fails, you transfer its checkpoint to a spare node and invoke >> ompi-restart. In 1.5, ompi-restart automagically handles this >> situation (if you supply a hostfile) and is able to restart the >> process, but I'm afraid it might not always be able to find the >> checkpoints this way. If you could specify to ompi-restart where the >> ranks are (and thus where the checkpoints are), then maybe restart >> would always work as long (as long as you've specified the location of >> the checkpoints correctly), or maybe ompi-restart would be faster. > > We can easily add the --rankfile option to ompi-restart. I filed a ticket to > add this option, and assess if there are other options that we should pass > along (e.g., --npernode, --byhost). I should be able to fix this in the next > week or so, but the ticket is linked below so you can follow the progress. > https://svn.open-mpi.org/trac/ompi/ticket/2413 Nice, thanks! > Most of the ompi-restart parameters are passed directly to the mpirun > command. ompi-restart is mostly a wrapper around mpirun that is able to > parse the metadata and create the appcontext file. I wonder if a more > general parameter like '--mpirun-args ...' might make sense so users don't > have to wait on me to expose the interface they need. > > Donno. What do you think? Should I create a '--mpirun-args' option or > duplicate all of the mpirun command line parameters, or some combination of > the two. Well, I think an --mpirun-args argument would be even more useful, as it's hard to foresee how ompi-restart is gonna be used. Maybe a combination of the two would be ideal, since some options are going to be used very often (i.e. --hostfile, --hosts, etc.). Regards,
Re: [OMPI users] [sge::tight-integration] slot scheduling and resources handling
Hi, Am 21.05.2010 um 14:11 schrieb Eloi Gaudry: > Hi there, > > I'm observing something strange on our cluster managed by SGE6.2u4 when > launching a parallel computation on several nodes, using OpenMPI/SGE tight- > integration mode (OpenMPI-1.3.3). It seems that the SGE allocated slots are > not used by OpenMPI, as if OpenMPI was doing is own round-robin allocation > based on the allocated node hostnames. you compiled Open MPI with --with-sge (and recompiled your applications)? You are using the correct mpiexec? -- Reuti > Here is what I'm doing: > - launch a parallel computation involving 8 processors, using for each of > them > 14GB of memory. I'm using a qsub command where i request memory_free resource > and use tight integration with openmpi > - 3 servers are available: > . barney with 4 cores (4 slots) and 32GB > . carl with 4 cores (4 slots) and 32GB > . charlie with 8 cores (8 slots) and 64GB > > Here is the output of the allocated nodes (OpenMPI output): > == ALLOCATED NODES == > > Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 > Daemon: [[44332,0],0] Daemon launched: True > Num slots: 4 Slots in use: 0 > Num slots allocated: 4 Max slots: 0 > Username on node: NULL > Num procs: 0 Next node_rank: 0 > Data for node: Name: carl.fftLaunch id: -1 Arch: 0 State: 2 > Daemon: Not defined Daemon launched: False > Num slots: 2 Slots in use: 0 > Num slots allocated: 2 Max slots: 0 > Username on node: NULL > Num procs: 0 Next node_rank: 0 > Data for node: Name: barney.fftLaunch id: -1 Arch: 0 State: 2 > Daemon: Not defined Daemon launched: False > Num slots: 2 Slots in use: 0 > Num slots allocated: 2 Max slots: 0 > Username on node: NULL > Num procs: 0 Next node_rank: 0 > > = > > Here is what I see when my computation is running on the cluster: > # rank pid hostname > 0 28112 charlie > 1 11417 carl > 2 11808 barney > 3 28113 charlie > 4 11418 carl > 5 11809 barney > 6 28114 charlie > 7 11419 carl > > Note that -the parallel environment used under SGE is defined as: > [eg@moe:~]$ qconf -sp round_robin > pe_nameround_robin > slots 32 > user_lists NONE > xuser_listsNONE > start_proc_args/bin/true > stop_proc_args /bin/true > allocation_rule$round_robin > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > I'm wondering why OpenMPI didn't use the allocated nodes chosen by SGE (cf. > "ALLOCATED NODES" report) but instead allocate each job of the parallel > computation at a time, using a round-robin method. > > Note that I'm using the '--bynode' option in the orterun command line. If the > behavior I'm observing is simply the consequence of using this option, please > let me know. This would eventually mean that one need to state that SGE tight- > integration has a lower priority on orterun behavior than the different > command > line options. > > Any help would be appreciated, > Thanks, > Eloi > > > -- > > > Eloi Gaudry > > Free Field Technologies > Axis Park Louvain-la-Neuve > Rue Emile Francqui, 1 > B-1435 Mont-Saint Guibert > BELGIUM > > Company Phone: +32 10 487 959 > Company Fax: +32 10 454 626 > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] An error occured in MPI_Bcast; MPI_ERR_TYPE: invalid datatype
Pankatz, Klaus wrote: Hi folks, openMPI 1.4.1 seems to have another problem with my machine, or something on it. This little program here (compiled with mpif90) startet with mpiexec -np 4 a.out produces the following output: Suriprisingly the same thing written in C-Code (compiled with mpiCC) works without a problem. Not so surprising since it's C code! For Fortran: MPI_INT -> MPI_INTEGER and add an ierror argument to your MPI_Bcast call (the way you have for MPI_Comm_rank/size).
Re: [OMPI users] An error occured in MPI_Bcast; MPI_ERR_TYPE: invalid datatype
Your fortran call to 'mpi_bcast' needs a status parameter at the end of the argument list. Also, I don't think 'MPI_INT' is correct for fortran, it should be 'MPI_INTEGER'. With these changes the program works OK. T. Rosmond On Fri, 2010-05-21 at 11:40 +0200, Pankatz, Klaus wrote: > Hi folks, > > openMPI 1.4.1 seems to have another problem with my machine, or something on > it. > > This little program here (compiled with mpif90) startet with mpiexec -np 4 > a.out produces the following output: > Suriprisingly the same thing written in C-Code (compiled with mpiCC) works > without a problem. > May it be a interference with other MPI-distributions although I think I have > deleted all? > > Note: The error occurs also with my climate model. The error is nearly the > same, only with MPI_ERR_TYPE: invalid root. > I've compiled openMPI not as root root, but in my home-directory. > > Thanks for your advice, > Klaus > > My machine: > > OpenMPI-version 1.4.1 compiled with Lahey Fortran 95 (lf95). > > OpenMPI was compiled "out of the box" only changing to the Lahey compiler > > with a setenv $FC lf95 > > > > The System: Linux marvin 2.6.27.6-1 #1 SMP Sat Nov 15 20:19:04 CET 2008 > > x86_64 GNU/Linux > > > > Compiler: Lahey/Fujitsu Linux64 Fortran Compiler Release L8.10a > > *** > Output: > [marvin:21997] *** An error occurred in MPI_Bcast > [marvin:21997] *** on communicator MPI_COMM_WORLD > [marvin:21997] *** MPI_ERR_TYPE: invalid datatype > [marvin:21997] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > Process 1 : k= 10 before > -- > mpiexec has exited due to process rank 1 with PID 21997 on > node marvin exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpiexec (as reported here). > -- > [marvin:21993] 3 more processes have sent help message help-mpi-errors.txt / > mpi_errors_are_fatal > [marvin:21993] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > Process 3 : k= 10 before > > Program Fortran90: > include 'mpif.h' > > integer k, rank, size, ierror, tag, p > > > call MPI_INIT(ierror) > call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) > call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) > if (rank == 0) then > k = 20 > else > k = 10 > end if > do p= 0,size,1 > > if (rank == p) then > print*, 'Process', p,': k=', k, 'before' > > end if > > enddo > call MPI_Bcast(k, 1, MPI_INT,0,MPI_COMM_WORLD) > do p =0,size,1 > if (rank == p) then > print*, 'Process', p, ': k=', k, 'after' > end if > enddo > call MPI_Finalize(ierror) > > end > > Program C-Code: > > #include > #include > int main (int argc, char *argv[]) > { > int k,id,p,size; > MPI_Init(&argc,&argv); > MPI_Comm_rank(MPI_COMM_WORLD, &id); > MPI_Comm_size(MPI_COMM_WORLD, &size); > if(id == 0) > k = 20; > else > k = 10; > for(p=0; p if(id == p) > printf("Process %d: k= %d before\n",id,k); > } > //note MPI_Bcast must be put where all other processes > //can see it. > MPI_Bcast(&k,1,MPI_INT,0,MPI_COMM_WORLD); > for(p=0; p if(id == p) > printf("Process %d: k= %d after\n",id,k); > } > MPI_Finalize(); > return 0 ; > } > *** > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] [sge::tight-integration] slot scheduling and resources handling
Hi Reuti, Yes, the openmpi binaries used were build after having used the --with-sge during configure, and we only use those binaries on our cluster. [eg@moe:~]$ /opt/openmpi-1.3.3/bin/ompi_info Package: Open MPI root@moe Distribution Open MPI: 1.3.3 Open MPI SVN revision: r21666 Open MPI release date: Jul 14, 2009 Open RTE: 1.3.3 Open RTE SVN revision: r21666 Open RTE release date: Jul 14, 2009 OPAL: 1.3.3 OPAL SVN revision: r21666 OPAL release date: Jul 14, 2009 Ident string: 1.3.3 Prefix: /opt/openmpi-1.3.3 Configured architecture: x86_64-unknown-linux-gnu Configure host: moe Configured by: root Configured on: Tue Nov 10 11:19:34 CET 2009 Configure host: moe Built by: root Built on: Tue Nov 10 11:28:14 CET 2009 Built host: moe C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: gfortran Fortran77 compiler abs: /usr/bin/gfortran Fortran90 compiler: gfortran Fortran90 compiler abs: /usr/bin/gfortran C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: yes Thread support: posix (mpi: no, progress: no) Sparse Groups: no Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: no mpirun default --prefix: no MPI I/O support: yes MPI_WTIME support: gettimeofday Symbol visibility support: yes FT Checkpoint support: no (checkpoint thread: no) MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3) MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3) MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3) MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3) MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3) MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3) MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3) MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3) MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3) MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3) MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3) MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3) MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3) MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3) MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3) MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3) MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3) MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3) MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3) MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3) MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3) MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3) MCA btl: gm (MCA v2.0, API v2.0, Component v1.3.3) MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3) MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3) MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3) MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3) MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3) MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3) MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3) MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3) MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3) MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3) MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3) MCA ras: gri
Re: [OMPI users] [sge::tight-integration] slot scheduling and resources handling
Hi, Am 21.05.2010 um 17:19 schrieb Eloi Gaudry: > Hi Reuti, > > Yes, the openmpi binaries used were build after having used the --with-sge > during configure, and we only use those binaries on our cluster. > > [eg@moe:~]$ /opt/openmpi-1.3.3/bin/ompi_info > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3) ok. As you have a Tight Integration as goal and set in your PE "control_slaves TRUE", SGE wouldn't allow `qrsh -inherit ...` to nodes which are not in the list of granted nodes. So it looks, like your job is running outside of this Tight Integration with its own `rsh`or `ssh`. Do you reset $JOB_ID or other environment variables in your jobscript, which could trigger Open MPI to assume that it's not running inside SGE? -- Reuti > > > On Friday 21 May 2010 16:01:54 Reuti wrote: >> Hi, >> >> Am 21.05.2010 um 14:11 schrieb Eloi Gaudry: >>> Hi there, >>> >>> I'm observing something strange on our cluster managed by SGE6.2u4 when >>> launching a parallel computation on several nodes, using OpenMPI/SGE >>> tight- integration mode (OpenMPI-1.3.3). It seems that the SGE allocated >>> slots are not used by OpenMPI, as if OpenMPI was doing is own >>> round-robin allocation based on the allocated node hostnames. >> >> you compiled Open MPI with --with-sge (and recompiled your applications)? >> You are using the correct mpiexec? >> >> -- Reuti >> >>> Here is what I'm doing: >>> - launch a parallel computation involving 8 processors, using for each of >>> them 14GB of memory. I'm using a qsub command where i request >>> memory_free resource and use tight integration with openmpi >>> - 3 servers are available: >>> . barney with 4 cores (4 slots) and 32GB >>> . carl with 4 cores (4 slots) and 32GB >>> . charlie with 8 cores (8 slots) and 64GB >>> >>> Here is the output of the allocated nodes (OpenMPI output): >>> == ALLOCATED NODES == >>> >>> Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 >>> >>> Daemon: [[44332,0],0] Daemon launched: True >>> Num slots: 4 Slots in use: 0 >>> Num slots allocated: 4 Max slots: 0 >>> Username on node: NULL >>> Num procs: 0 Next node_rank: 0 >>> >>> Data for node: Name: carl.fftLaunch id: -1 Arch: 0 State: 2 >>> >>> Daemon: Not defined Daemon launched: False >>> Num slots: 2 Slots in use: 0 >>> Num slots allocated: 2 Max slots: 0 >>> Username on node: NULL >>> Num procs: 0 Next node_rank: 0 >>> >>> Data for node: Name: barney.fftLaunch id: -1 Arch: 0 State: 2 >>> >>> Daemon: Not defined Daemon launched: False >>> Num slots: 2 Slots in use: 0 >>> Num slots allocated: 2 Max slots: 0 >>> Username on node: NULL >>> Num procs: 0 Next node_rank: 0 >>> >>> = >>> >>> Here is what I see when my computation is running on the cluster: >>> # rank pid hostname >>> >>>0 28112 charlie >>>1 11417 carl >>>2 11808 barney >>>3 28113 charlie >>>4 11418 carl >>>5 11809 barney >>>6 28114 charlie >>>7 11419 carl >>> >>> Note that -the parallel environment used under SGE is defined as: >>> [eg@moe:~]$ qconf -sp round_robin >>> pe_nameround_robin >>> slots 32 >>> user_lists NONE >>> xuser_listsNONE >>> start_proc_args/bin/true >>> stop_proc_args /bin/true >>> allocation_rule$round_robin >>> control_slaves TRUE >>> job_is_first_task FALSE >>> urgency_slots min >>> accounting_summary FALSE >>> >>> I'm wondering why OpenMPI didn't use the allocated nodes chosen by SGE >>> (cf. "ALLOCATED NODES" report) but instead allocate each job of the >>> parallel computation at a time, using a round-robin method. >>> >>> Note that I'm using the '--bynode' option in the orterun command line. If >>> the behavior I'm observing is simply the consequence of using this >>> option, please let me know. This would eventually mean that one need to >>> state that SGE tight- integration has a lower priority on orterun >>> behavior than the different command line options. >>> >>> Any help would be appreciated, >>> Thanks, >>> Eloi >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > > > Eloi Gaudry > > Free Field Technologies > Axis Park Louvain-la-Neuve > Rue Emile Francqui, 1 > B-1435 Mont-Saint Guibert > BELGIUM > > Company Phone: +32 10 487 959 > Company Fax: +32 10 454 626
Re: [OMPI users] GM + OpenMPI bug ...
Hi Jose, On 5/21/2010 6:54 AM, José Ignacio Aliaga Estellés wrote: We have used the lspci -vvxxx and we have obtained: bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit Ethernet Controller (Copper) (rev 02) This is the output for the Intel GigE NIC, you should look at the one for the Myricom NIC and the PCI bridge above it (lspci -t to see the tree). bi00: Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- PERR- status means no parity detected when receiving data. Looking at the PERR status of the PCI bridge on the other side will show if there was in corruption on that bus. As a first step, you can see if you can reproduce errors with a simple test involving a single node at a time. You can run "gm_allsize --verify" on each machine: it will send packets to itself (loopback in the switch) and check for corruption. If you don't see errors after a while, that node is probably clean. If you see errors, you can look deeper at lspci output to see if it's a PCI problem. If you are using a riser card, you can try without. I am not sure if openMPI has an option to enable debug checksum, but it would also be useful to see if it detects anything. Additionally, if you know any software tool or methodology to check the hardware/software, please, could you send us how to do it? You may want to look at the FAQ on GM troubleshooting: http://www.myri.com/cgi-bin/fom.pl?file=425 Additionally, you can send email to h...@myri.com to open a ticket. Patrick
Re: [OMPI users] Some Questions on Building OMPI on Linux Em64t
Hello, I am resending this because I am not sure if it was sent out to the OMPI list. Any help would be greatly appreciated. best Michael On 05/19/10 13:19, Michael E. Thomadakis wrote: Hello, I would like to build OMPI V1.4.2 and make it available to our users at the Supercomputing Center at Texas A&M Univ. Our system is a 2-socket, 4-core Nehalem @2.8GHz, 24GiB DRAM / node, 324 nodes connected to 4xQDR Voltaire fabric, CentOS/RHEL 5.4. I have been trying to find the following information : 1) high-resolution timers: how do I specify the HRT linux timers in the --with-timer=TYPE line of ./configure ? 2) I have installed blcr V0.8.2 but when I try to built OMPI and I point to the full installation it complains it cannot find it. Note that I build BLCR with GCC but I am building OMPI with Intel compilers (V11.1) 3) Does OMPI by default use SHM for intra-node message IPC but reverts to IB for inter-node ? 4) How could I select the high-speed transport, say DAPL or OFED IB verbs ? Is there any preference as to the specific high-speed transport over Mellanox/Voltaire QDR IB? 5) When we launch MPI jobs via PBS/TORQUE do we have control on the task and thread placement on nodes/cores ? 6) Can we suspend/restart cleanly OMPI jobs with the above scheduler ? Any caveats on suspension / resumption of OMPI jobs ? 7) Do you have any performance data comparing OMPI vs say MVAPICHv2 and IntelMPI ? This is not a political issue since I am groing to be providing all these MPI stacks to our users (IntelMPI V4.0 already installed). Thank you so much for the great s/w ... best Michael % \ % Michael E. Thomadakis, Ph.D. Senior Lead Supercomputer Engineer/Res \ % E-mail: miket AT tamu DOT edu Texas A&M University \ % web:http://alphamike.tamu.edu Supercomputing Center \ % Voice: 979-862-3931Teague Research Center, 104B \ % FAX:979-847-8643 College Station, TX 77843, USA \ % \
Re: [OMPI users] General question on the implementation ofa"scheduler" on client side...
On May 21, 2010, at 3:13 AM, Olivier Riff wrote: > -> That is what I was thinking about to implement. As you mentioned, and > specifically for my case where I mainly send short messages, there might not > be much win. By the way, are their some benchmarks testing sequential > MPI_ISend versus MPI_BCAST for instance ? The aim is to determine up which > buffer size a MPI_BCast is worth to be used for my case. You can answer that > the test is easy to do and that I can test it by myself :) "It depends". :-) You're probably best writing a benchmark yourself that mirrors what your application is going to do. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/