[OMPI users] has anybody used the Intel Thread Checker w/OpenMPI?
I'm interested in getting OpenMPI working with a multi-threaded application (MPI_THREAD_MULTIPLE is required). I'm trying the trunk from a couple weeks ago (1.3a1r14001) compiled for multi-threading and threaded progress, and have had success with some small cases. Larger cases with the same algorithms fail (they work with MPICH2 1.0.5/TCP and other thread-safe MPIs, so I don't think it is an application bug). I don't mind doing a little work to track down the problem, so I'm trying to use the Intel Thread Checker. I have the thread checker working with my application when using Intel's MPI, but with OpenMPI it hangs. OpenMPI is compiled for OFED 1.1, but I'm overriding communications with "-gmca btl self,tcp" in the hope that OpenMPI won't do anything funky that would cause the thread checker problems (like RMDA or writes from other processes into shared memory segments). Has anybody used the Intel Thread Checker with OpenMPI successfully? Thanks, Curt -- Curtis Janssen, clja...@sandia.gov, +1 925-294-1509 Sandia National Labs, MS 9158, PO Box 969, Livermore, CA 94551, USA
[OMPI users] error in MPI_Waitall
Hi, I am trying to run an MPICH2 application over 2 processors on a dual processor x64 Linux box (SuSE 10). I am getting the following error message: -- Fatal error in MPI_Waitall: Other MPI error, error stack: MPI_Waitall(242)..: MPI_Waitall(count=2, req_array=0x5bbda70, status_array=0x7fff461d9ce0) failed MPIDI_CH3_Progress_wait(212)..: an error occurred while handling an event returned by MPIDU_Sock_Wait() MPIDI_CH3I_Progress_handle_sock_event(413): MPIDU_Socki_handle_read(633)..: connection failure (set=0,sock=1,errno=104:Connection reset by peer) rank 0 in job 2 Demeter_18432 caused collective abort of all ranks exit status of rank 0: killed by signal 11 -- The "cpi" example that comes with MPICH2 executes correctly. I am using MPICH2-1.0.5p2 which I compiled from source. Does anyone know what the problem is? cheers steve Climate change will impact on everyone… Queensland takes action Register your interest in attending at http://www.nrw.qld.gov.au/events/nrconference/index.html Natural Resources Conference 2007 Climate Change - Queensland takes action Wednesday 23 May 2007 Brisbane Convention and Exhibition Centre The information in this email together with any attachments is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any form of review, disclosure, modification, distribution and/or publication of this email message is prohibited, unless as a necessary part of Departmental business. If you have received this message in error, you are asked to inform the sender as quickly as possible and delete this message and any copies of this message from your computer and/or your computer system network.
Re: [OMPI users] has anybody used the Intel Thread Checker w/OpenMPI?
Hello Curtis, yes, done with ompi-trunk: Apart from --enable-mpi-threads --enable-progress-threads, You need to compile Open MPI with --enable-mca-no-build=memory-ptmalloc2 ; and of course the usual options for debugging (--enable-debug) and the options for icc/ifort/icpc: CFLAGS='-debug all -inline-debug-info -tcheck' CXXFLAGS='-debug all -inline-debug-info -tcheck' FFLAGS='-debug all -tcheck' LDFLAGS='-tcheck' Then, as You already noted, run the application with --mca btl tcp,sm,self: mpirun --mca tcp,sm,self -np 2\ tcheck_cl\ --reinstrument \ -u all \ -c \ -d '/tmp/hpcraink_$$__tc_cl_cache' \ -f html\ -o 'tc_mpi_test_suite_$$.html' \ -p 'file=tc_mpi_test_suite_%H_%I, \ pad=128, \ delay=2, \ stall=2' \ -- \ ./mpi_test_suite -j 2 -r FULL -t 'Ring Ibsend' -d MPI_INT -- the reinstrument is not really necessary, also setting the padding and delay for startup of threads; shortenign the delay for stalls to 2 seconds alos does not trigger any deadlocks. This was with icc-9.1 and itt-3.0 23205. Hope this helps, Rainer On Friday 23 March 2007 05:22, Curtis Janssen wrote: > I'm interested in getting OpenMPI working with a multi-threaded > application (MPI_THREAD_MULTIPLE is required). I'm trying the trunk > from a couple weeks ago (1.3a1r14001) compiled for multi-threading and > threaded progress, and have had success with some small cases. Larger > cases with the same algorithms fail (they work with MPICH2 1.0.5/TCP and > other thread-safe MPIs, so I don't think it is an application bug). I > don't mind doing a little work to track down the problem, so I'm trying > to use the Intel Thread Checker. I have the thread checker working with > my application when using Intel's MPI, but with OpenMPI it hangs. > OpenMPI is compiled for OFED 1.1, but I'm overriding communications with > "-gmca btl self,tcp" in the hope that OpenMPI won't do anything funky > that would cause the thread checker problems (like RMDA or writes from > other processes into shared memory segments). Has anybody used the > Intel Thread Checker with OpenMPI successfully? > > Thanks, > Curt -- Dipl.-Inf. Rainer Keller http://www.hlrs.de/people/keller High Performance Computing Tel: ++49 (0)711-685 6 5858 Center Stuttgart (HLRS) Fax: ++49 (0)711-685 6 5832 POSTAL:Nobelstrasse 19 email: kel...@hlrs.de ACTUAL:Allmandring 30, R.O.030AIM:rusraink 70550 Stuttgart
Re: [OMPI users] error in MPI_Waitall
Steve, This list is for supporting Open MPI, not MPICH2 (MPICH2 is an entirely different software package). You should probably redirect your question to their support lists. Thanks, Tim On Mar 23, 2007, at 12:46 AM, Jeffrey Stephen wrote: Hi, I am trying to run an MPICH2 application over 2 processors on a dual processor x64 Linux box (SuSE 10). I am getting the following error message: -- Fatal error in MPI_Waitall: Other MPI error, error stack: MPI_Waitall(242)..: MPI_Waitall(count=2, req_array=0x5bbda70, status_array=0x7fff461d9ce0) failed MPIDI_CH3_Progress_wait(212)..: an error occurred while handling an event returned by MPIDU_Sock_Wait() MPIDI_CH3I_Progress_handle_sock_event(413): MPIDU_Socki_handle_read(633)..: connection failure (set=0,sock=1,errno=104:Connection reset by peer) rank 0 in job 2 Demeter_18432 caused collective abort of all ranks exit status of rank 0: killed by signal 11 -- The "cpi" example that comes with MPICH2 executes correctly. I am using MPICH2-1.0.5p2 which I compiled from source. Does anyone know what the problem is? cheers steve ** ** Climate change will impact on everyone… Queensland takes action Register your interest in attending at http://www.nrw.qld.gov.au/ events/nrconference/index.html Natural Resources Conference 2007 Climate Change - Queensland takes action Wednesday 23 May 2007 Brisbane Convention and Exhibition Centre ** ** The information in this email together with any attachments is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any form of review, disclosure, modification, distribution and/or publication of this email message is prohibited, unless as a necessary part of Departmental business. If you have received this message in error, you are asked to inform the sender as quickly as possible and delete this message and any copies of this message from your computer and/or your computer system network. ** ** ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Problems compiling openmpi 1.2 under AIX 5.2
Hi guys I'm having problems compiling openmpi 1.2 under AIX 5.2. Here are the configure parameters: ./configure --disable-shared --enable-static \ CC=xlc CXX=xlc++ F77=xlf FC=xlf95 To get it to work I have to do 2 changes: diff -r openmpi-1.2/ompi/mpi/cxx/mpicxx.cc openmpi-1.2-aix/ompi/mpi/ cxx/mpicxx.cc 34a35,38 > #undef SEEK_SET > #undef SEEK_CUR > #undef SEEK_END > diff -r openmpi-1.2/orte/mca/pls/poe/pls_poe_module.c openmpi-1.2-aix/ orte/mca/pls/poe/pls_poe_module.c 636a637,641 > static int pls_poe_cancel_operation(void) > { > return ORTE_ERR_NOT_IMPLEMENTED; > } This last one means that when you run OpenMPI jobs through POE you get a: [r1blade003:381130] [0,0,0] ORTE_ERROR_LOG: Not implemented in file errmgr_hnp.c at line 90 -- mpirun was unable to cleanly terminate the daemons for this job. Returned value Not implemented instead of ORTE_SUCCESS. at the job end. Keep up the good work, cheers, Ricardo --- Prof. Ricardo Fonseca GoLP - Grupo de Lasers e Plasmas Centro de Física dos Plasmas Instituto Superior Técnico Av. Rovisco Pais 1049-001 Lisboa Portugal tel: +351 21 8419202 fax: +351 21 8464455 web: http://cfp.ist.utl.pt/golp/
Re: [OMPI users] segfault with netpipe & ompi 1.2 + MX (32bit only)
Nicolas Niclausse ecrivait le 21.03.2007 16:45: > I'm trying to use netpipe with openmpi on my system (rhel 3, dual opteron, > myrinet 2G with MX drivers). > > Everything is fine when i use a 64bit binary, but it segfaults when i use a > 32 bit binary : I rebuilt everything with PGI 6.2 instead of 6.0 and everything is working as expected now. -- Nicolas NICLAUSSE Service DREAM INRIA Sophia Antipolis http://www-sop.inria.fr/
Re: [OMPI users] quadrics
I can volunteer myself as a beta-tester if that's OK. If there is anything specific you want help with either drop me a mail directly or mail supp...@quadrics.com We are not aware of any other current project of this nature. Ashley, On Mon, 2007-03-19 at 18:48 -0400, George Bosilca wrote: > UTK is working on Quadrics support. Right now, we have an embryo of > Quadrics support. The work is still in progress. I can let you know > as soon as we have something that pass most of our test, and we are > confident enough to give it to beta-testers. > >Thanks, > george. > > On Mar 18, 2007, at 11:07 PM, Robin Humble wrote: > > > > > does OpenMPI support Quadrics elan3/4 interconnects? > > > > I saw a few hits on google suggesting that support was partial or > > maybe > > planned, but couldn't find much in the openmpi sources to suggest any > > support at all. > > > > cheers, > > robin > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Cell EIB support for OpenMPI
Marcus G. Daniels wrote: Mike Houston wrote: The main issue with this, and addressed at the end of the report, is that the code size is going to be a problem as data and code must live in the same 256KB in each SPE. They mention dynamic overlay loading, which is also how we deal with large code size, but things get tricky and slow with the potentially needed save and restore of registers and LS. I did some checking on this. Apparently the trunk of GCC and the latest GNU Binutils handle overlays. Because the SPU compiler knows of its limits address space, the ELF object code sections reflect this, and the the linker can transparently generate stubs to trigger the loading. GCC also has options like -ffunction-sections that enable the linker to optimize for locality. So even though the OpenMPI shared libraries in total appear to have a footprint about four times too big for code alone (don't know about the typical stack & heap requirements), perhaps it's still doable without a big effort to strip down OpenMPI?
[OMPI users] Failure to launch on a remote node. SSH problem?
I am presently trying to get OpenMPI up and running on a small cluster of MacPros (dual dual-core Xeons) using TCP. Opne MPI was compiled using the intel Fortran Compiler (9.1) and gcc. When I try to launch a job on a remote node, orted starts on the remote node but then times out. I am guessing that the problem is SSH related. Any thoughts? Thanks, Dave Details: I am using SSH, set up as outlined in the FAQ, using ssh-agent to allow passwordless logins. The paths for all the libraries appear to be OK. A simple MPI code (Hello_World_Fortran) launched on node01 will run OK for up to four processors (all on node01). The output is shown here. node01 1247% mpirun --debug-daemons -hostfile machinefile -np 4 Hello_World_Fortran Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT Fortran version of Hello World, rank2 Rank 0 is present in Fortran version of Hello World. Fortran version of Hello World, rank3 Fortran version of Hello World, rank1 For five processors mpirun tries to start an additional process on node03. Everything launches the same on node01 (four instances of Hello_World_Fortran are launched). On node03, orted starts, but times out after 10 seconds and the output below is generated. node01 1246% mpirun --debug-daemons -hostfile machinefile -np 5 Hello_World_Fortran Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT [node03:02422] [0,0,1]-[0,0,0] mca_oob_tcp_peer_send_blocking: send() failed with errno=57 [node01.local:21427] ERROR: A daemon on node node03 failed to start as expected. [node01.local:21427] ERROR: There may be more information available from [node01.local:21427] ERROR: the remote shell (see above). [node01.local:21427] ERROR: The daemon exited unexpectedly with status 255. forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) Here is the ompi info: node01 1248% ompi_info --all Open MPI: 1.1.2 Open MPI SVN revision: r12073 Open RTE: 1.1.2 Open RTE SVN revision: r12073 OPAL: 1.1.2 OPAL SVN revision: r12073 MCA memory: darwin (MCA v1.0, API v1.0, Component v1.1.2) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.2) MCA timer: darwin (MCA v1.0, API v1.0, Component v1.1.2) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2) MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0) MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2) MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: xgrid (MCA v1.0, API v1.0, Component v1.1.2) MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1.2) MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2) MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1.2) MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1.2) MCA rml: oob (MCA v1.0, API v1.0, Component v1.1.2) MCA pls: fork (MCA v1.0, API v1.0, Component v1.1.2) MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1.2)
Re: [OMPI users] Cell EIB support for OpenMPI
Marcus G. Daniels wrote: Marcus G. Daniels wrote: Mike Houston wrote: The main issue with this, and addressed at the end of the report, is that the code size is going to be a problem as data and code must live in the same 256KB in each SPE. They mention dynamic overlay loading, which is also how we deal with large code size, but things get tricky and slow with the potentially needed save and restore of registers and LS. I did some checking on this. Apparently the trunk of GCC and the latest GNU Binutils handle overlays. Because the SPU compiler knows of its limits address space, the ELF object code sections reflect this, and the the linker can transparently generate stubs to trigger the loading. GCC also has options like -ffunction-sections that enable the linker to optimize for locality. So even though the OpenMPI shared libraries in total appear to have a footprint about four times too big for code alone (don't know about the typical stack & heap requirements), perhaps it's still doable without a big effort to strip down OpenMPI? But loading an overlay can be quite expensive depending on how much needs to be loaded and how much user data/code needs to be restored. If the user is trying to use most of the LS for data, which is perfectly sane and reasonable, then you might have to load multiple overlays to complete a function. We've also been having issues with mixing manual overlay loading of our code with the autoloading generated by the compiler. Regardless, it would be interesting to see if this can even be made to work. If so, it might really help people get apps up on Cell since it can be reasonably thought of as a cluster on a chip, backed by a larger address space. -Mike
Re: [OMPI users] MPI processes swapping out
Todd: I assume the system time is being consumed by the calls to send and receive data over the TCP sockets. As the number of processes in the job increases, then more time is spent waiting for data from one of the other processes. I did a little experiment on a single node to see the difference in system time consumed when running over TCP vs when running over shared memory. When running on a single node and using the sm btl, I see almost 100% user time. I assume this is because the sm btl handles sending and receiving its data within a shared memory segment. However, when I switch over to TCP, I see my system time go up. Note that this is on Solaris. RUNNING OVER SELF,SM > mpirun -np 8 -mca btl self,sm hpcc.amd64 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 3505 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 75 182 0 hpcc.amd64/1 3503 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0 69 116 0 hpcc.amd64/1 3499 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0 106 236 0 hpcc.amd64/1 3497 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 169 200 0 hpcc.amd64/1 3501 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 1.9 0 127 158 0 hpcc.amd64/1 3507 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 244 200 0 hpcc.amd64/1 3509 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 282 212 0 hpcc.amd64/1 3495 rolfv 97 0.0 0.0 0.0 0.0 0.0 0.0 3.2 0 237 98 0 hpcc.amd64/1 RUNNING OVER SELF,TCP >mpirun -np 8 -mca btl self,tcp hpcc.amd64 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 4316 rolfv 93 6.9 0.0 0.0 0.0 0.0 0.0 0.2 5 346 .6M 0 hpcc.amd64/1 4328 rolfv 91 8.4 0.0 0.0 0.0 0.0 0.0 0.4 3 59 .15 0 hpcc.amd64/1 4324 rolfv 98 1.1 0.0 0.0 0.0 0.0 0.0 0.7 2 270 .1M 0 hpcc.amd64/1 4320 rolfv 88 12 0.0 0.0 0.0 0.0 0.0 0.8 4 244 .15 0 hpcc.amd64/1 4322 rolfv 94 5.1 0.0 0.0 0.0 0.0 0.0 1.3 2 150 .2M 0 hpcc.amd64/1 4318 rolfv 92 6.7 0.0 0.0 0.0 0.0 0.0 1.4 5 236 .9M 0 hpcc.amd64/1 4326 rolfv 93 5.3 0.0 0.0 0.0 0.0 0.0 1.7 7 117 .2M 0 hpcc.amd64/1 4314 rolfv 91 6.6 0.0 0.0 0.0 0.0 1.3 0.9 19 150 .10 0 hpcc.amd64/1 I also ran HPL over a larger cluster of 6 nodes, and noticed even higher system times. And lastly, I ran a simple MPI test over a cluster of 64 nodes, 2 procs per node using Sun HPC ClusterTools 6, and saw about a 50/50 split between user and system time. PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 11525 rolfv 55 44 0.1 0.0 0.0 0.0 0.1 0.4 76 960 .3M 0 maxtrunc_ct6/1 11526 rolfv 54 45 0.0 0.0 0.0 0.0 0.0 1.0 0 362 .4M 0 maxtrunc_ct6/1 Is it possible that everything is working just as it should? Rolf Heywood, Todd wrote On 03/22/07 13:30,: Ralph, Well, according to the FAQ, aggressive mode can be "forced" so I did try setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning processor/memory affinity on. Efffects were minor. The MPI tasks still cycle bewteen run and sleep states, driving up system time well over user time. Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be sure, I also tried running directly with a hostfile with slots=4 or slots=2. The same behavior occurs. This behavior is a function of the size of the job. I.e. As I scale from 200 to 800 tasks the run/sleep cycling increases, so that system time grows from maybe half the user time to maybe 5 times user time. This is for TCP/gigE. Todd On 3/22/07 12:19 PM, "Ralph Castain" wrote: Just for clarification: ompi_info only shows the *default* value of the MCA parameter. In this case, mpi_yield_when_idle defaults to aggressive, but that value is reset internally if the system sees an "oversubscribed" condition. The issue here isn't how many cores are on the node, but rather how many were specifically allocated to this job. If the allocation wasn't at least 2 (in your example), then we would automatically reset mpi_yield_when_idle to be non-aggressive, regardless of how many cores are actually on the node. Ralph On 3/22/07 7:14 AM, "Heywood, Todd" wrote: Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a 4-core node, the 2 tasks are still cycling between run and sleep, with higher system time than user time. Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive), so that suggests the tasks aren't swapping out on bloccking calls. Still puzzled. Thanks, Todd On 3/22/07 7:36 AM, "Jeff Squyres" wrote: Are you using a scheduler on your system? More specifically, does Open MPI know that you have for process slots on each node? If you are using a hostfile and didn't specify "slots=4" for each host, Open MPI will think that it's oversubscribing and will therefore call sched_yield() in the depths of its progress engine. On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:
Re: [OMPI users] MPI processes swapping out
Rolf, > Is it possible that everything is working just as it should? That's what I'm afraid of :-). But I did not expect to see such communication overhead due to blocking from mpiBLAST, which is very course-grained. I then tried HPL, which is computation-heavy, and found the same thing. Also, the system time seemed to correspond to the MPI processes cycling between run and sleep (as seen via top), and I thought that setting the mpi_yield_when_idle parameter to 0 would keep the processes from entering sleep state when blocking. But it doesn't. Todd On 3/23/07 2:06 PM, "Rolf Vandevaart" wrote: > > Todd: > > I assume the system time is being consumed by > the calls to send and receive data over the TCP sockets. > As the number of processes in the job increases, then more > time is spent waiting for data from one of the other processes. > > I did a little experiment on a single node to see the difference > in system time consumed when running over TCP vs when > running over shared memory. When running on a single > node and using the sm btl, I see almost 100% user time. > I assume this is because the sm btl handles sending and > receiving its data within a shared memory segment. > However, when I switch over to TCP, I see my system time > go up. Note that this is on Solaris. > > RUNNING OVER SELF,SM >> mpirun -np 8 -mca btl self,sm hpcc.amd64 > >PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP > 3505 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 75 182 0 hpcc.amd64/1 > 3503 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0 69 116 0 hpcc.amd64/1 > 3499 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0 106 236 0 hpcc.amd64/1 > 3497 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 169 200 0 hpcc.amd64/1 > 3501 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 1.9 0 127 158 0 hpcc.amd64/1 > 3507 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 244 200 0 hpcc.amd64/1 > 3509 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 282 212 0 hpcc.amd64/1 > 3495 rolfv 97 0.0 0.0 0.0 0.0 0.0 0.0 3.2 0 237 98 0 hpcc.amd64/1 > > RUNNING OVER SELF,TCP >> mpirun -np 8 -mca btl self,tcp hpcc.amd64 > >PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP > 4316 rolfv 93 6.9 0.0 0.0 0.0 0.0 0.0 0.2 5 346 .6M 0 hpcc.amd64/1 > 4328 rolfv 91 8.4 0.0 0.0 0.0 0.0 0.0 0.4 3 59 .15 0 hpcc.amd64/1 > 4324 rolfv 98 1.1 0.0 0.0 0.0 0.0 0.0 0.7 2 270 .1M 0 hpcc.amd64/1 > 4320 rolfv 88 12 0.0 0.0 0.0 0.0 0.0 0.8 4 244 .15 0 hpcc.amd64/1 > 4322 rolfv 94 5.1 0.0 0.0 0.0 0.0 0.0 1.3 2 150 .2M 0 hpcc.amd64/1 > 4318 rolfv 92 6.7 0.0 0.0 0.0 0.0 0.0 1.4 5 236 .9M 0 hpcc.amd64/1 > 4326 rolfv 93 5.3 0.0 0.0 0.0 0.0 0.0 1.7 7 117 .2M 0 hpcc.amd64/1 > 4314 rolfv 91 6.6 0.0 0.0 0.0 0.0 1.3 0.9 19 150 .10 0 hpcc.amd64/1 > > I also ran HPL over a larger cluster of 6 nodes, and noticed even higher > system times. > > And lastly, I ran a simple MPI test over a cluster of 64 nodes, 2 procs > per node > using Sun HPC ClusterTools 6, and saw about a 50/50 split between user > and system time. > > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP > 11525 rolfv 55 44 0.1 0.0 0.0 0.0 0.1 0.4 76 960 .3M 0 > maxtrunc_ct6/1 > 11526 rolfv 54 45 0.0 0.0 0.0 0.0 0.0 1.0 0 362 .4M 0 > maxtrunc_ct6/1 > > Is it possible that everything is working just as it should? > > Rolf > > Heywood, Todd wrote On 03/22/07 13:30,: > >> Ralph, >> >> Well, according to the FAQ, aggressive mode can be "forced" so I did try >> setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning >> processor/memory affinity on. Efffects were minor. The MPI tasks still cycle >> bewteen run and sleep states, driving up system time well over user time. >> >> Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate >> (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be >> sure, I also tried running directly with a hostfile with slots=4 or slots=2. >> The same behavior occurs. >> >> This behavior is a function of the size of the job. I.e. As I scale from 200 >> to 800 tasks the run/sleep cycling increases, so that system time grows from >> maybe half the user time to maybe 5 times user time. >> >> This is for TCP/gigE. >> >> Todd >> >> >> On 3/22/07 12:19 PM, "Ralph Castain" wrote: >> >> >> >>> Just for clarification: ompi_info only shows the *default* value of the MCA >>> parameter. In this case, mpi_yield_when_idle defaults to aggressive, but >>> that value is reset internally if the system sees an "oversubscribed" >>> condition. >>> >>> The issue here isn't how many cores are on the node, but rather how many >>> were specifically allocated to this job. If the allocation wasn't at least 2 >>> (in your example), then we would automatically reset mpi_yield_when_idle to >>> be non-aggressive, regardless of how many cores are actually o
[OMPI users] install error
To ALL I am getting the following error while attempting to install openmpi on a Linux System - as follows Linux utahwtm.hydropoint.com 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST 2006 x86_64 x86_64 x86_64 GNU/Linux with the lntel compilers that are the latest versions of 9.1 this is the ERROR libtool: link: icc -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -pthread -o opal_wrapper opal_wrapper.o -Wl,--export-dynamic -pthread ../../../opal/.libs/libopen-pal.a -lnsl -lutil ../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text+0x1d): In function `munmap': : undefined reference to `__munmap' ../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text+0x52): In function `opal_mem_free_ptmalloc2_munmap': : undefined reference to `__munmap' ../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text+0x66): In function `mmap': : undefined reference to `__mmap' ../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text+0x8d): In function `opal_mem_free_ptmalloc2_mmap': : undefined reference to `__mmap' make[2]: *** [opal_wrapper] Error 1 make[2]: Leaving directory `/home/dad/model/openmpi-1.2/opal/tools/wrappers' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/dad/model/openmpi-1.2/opal' make: *** [all-recursive] Error 1 the config command was ./configure CC=icc CXX=icpc F77=ifort FC=ifort --disable-shared --enable-static --prefix=/model/OPENMP_I and executed with no errors I have attached both the config.log and the compile.log Any help or direction would greatly be appreciated.
Re: [OMPI users] MPI processes swapping out
So far the described behavior seems as normal as expected. As Open MPI never goes in blocking mode, the processes will always spin between active and sleep mode. More processes on the same node leads to more time in the system mode (because of the empty polls). There is a trick in the trunk version of Open MPI which will trigger the blocking mode if and only if TCP is the only used device. Please try add "--mca btl tcp,self" to your mpirun command line, and check the output of vmstat. Thanks, george. On Mar 23, 2007, at 3:32 PM, Heywood, Todd wrote: Rolf, Is it possible that everything is working just as it should? That's what I'm afraid of :-). But I did not expect to see such communication overhead due to blocking from mpiBLAST, which is very course-grained. I then tried HPL, which is computation-heavy, and found the same thing. Also, the system time seemed to correspond to the MPI processes cycling between run and sleep (as seen via top), and I thought that setting the mpi_yield_when_idle parameter to 0 would keep the processes from entering sleep state when blocking. But it doesn't. Todd On 3/23/07 2:06 PM, "Rolf Vandevaart" wrote: Todd: I assume the system time is being consumed by the calls to send and receive data over the TCP sockets. As the number of processes in the job increases, then more time is spent waiting for data from one of the other processes. I did a little experiment on a single node to see the difference in system time consumed when running over TCP vs when running over shared memory. When running on a single node and using the sm btl, I see almost 100% user time. I assume this is because the sm btl handles sending and receiving its data within a shared memory segment. However, when I switch over to TCP, I see my system time go up. Note that this is on Solaris. RUNNING OVER SELF,SM mpirun -np 8 -mca btl self,sm hpcc.amd64 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 3505 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 75 182 0 hpcc.amd64/1 3503 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0 69 116 0 hpcc.amd64/1 3499 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0 106 236 0 hpcc.amd64/1 3497 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 169 200 0 hpcc.amd64/1 3501 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 1.9 0 127 158 0 hpcc.amd64/1 3507 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 244 200 0 hpcc.amd64/1 3509 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 282 212 0 hpcc.amd64/1 3495 rolfv 97 0.0 0.0 0.0 0.0 0.0 0.0 3.2 0 237 98 0 hpcc.amd64/1 RUNNING OVER SELF,TCP mpirun -np 8 -mca btl self,tcp hpcc.amd64 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 4316 rolfv 93 6.9 0.0 0.0 0.0 0.0 0.0 0.2 5 346 .6M 0 hpcc.amd64/1 4328 rolfv 91 8.4 0.0 0.0 0.0 0.0 0.0 0.4 3 59 .15 0 hpcc.amd64/1 4324 rolfv 98 1.1 0.0 0.0 0.0 0.0 0.0 0.7 2 270 .1M 0 hpcc.amd64/1 4320 rolfv 88 12 0.0 0.0 0.0 0.0 0.0 0.8 4 244 .15 0 hpcc.amd64/1 4322 rolfv 94 5.1 0.0 0.0 0.0 0.0 0.0 1.3 2 150 .2M 0 hpcc.amd64/1 4318 rolfv 92 6.7 0.0 0.0 0.0 0.0 0.0 1.4 5 236 .9M 0 hpcc.amd64/1 4326 rolfv 93 5.3 0.0 0.0 0.0 0.0 0.0 1.7 7 117 .2M 0 hpcc.amd64/1 4314 rolfv 91 6.6 0.0 0.0 0.0 0.0 1.3 0.9 19 150 .10 0 hpcc.amd64/1 I also ran HPL over a larger cluster of 6 nodes, and noticed even higher system times. And lastly, I ran a simple MPI test over a cluster of 64 nodes, 2 procs per node using Sun HPC ClusterTools 6, and saw about a 50/50 split between user and system time. PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 11525 rolfv 55 44 0.1 0.0 0.0 0.0 0.1 0.4 76 960 .3M 0 maxtrunc_ct6/1 11526 rolfv 54 45 0.0 0.0 0.0 0.0 0.0 1.0 0 362 .4M 0 maxtrunc_ct6/1 Is it possible that everything is working just as it should? Rolf Heywood, Todd wrote On 03/22/07 13:30,: Ralph, Well, according to the FAQ, aggressive mode can be "forced" so I did try setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning processor/memory affinity on. Efffects were minor. The MPI tasks still cycle bewteen run and sleep states, driving up system time well over user time. Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be sure, I also tried running directly with a hostfile with slots=4 or slots=2. The same behavior occurs. This behavior is a function of the size of the job. I.e. As I scale from 200 to 800 tasks the run/sleep cycling increases, so that system time grows from maybe half the user time to maybe 5 times user time. This is for TCP/gigE. Todd On 3/22/07 12:19 PM, "Ralph Castain" wrote: Just for clarification: ompi_info only shows the *default* value of the MCA parameter. In thi
Re: [OMPI users] Cell EIB support for OpenMPI
The main problem with MPI is the huge number of function in the API. Even if we implement only the 1.0 standard we still have several hundreds of functions around. Moreover, an MPI library is far from being a simple self-sufficient library, it requires a way to start and monitor processes, interact with the operating system and so on. All in all we end up with a multi-hundreds KB library which in most of the applications will be only used at 10%. We investigated this possiblity few months ago, but in front of the task of removing all unnecessary functions from Open MPI in order to get something that can fit in the 256KB of memory on the SPU (and of course still leave some empty room for the user) ... Moreover, most of the Cell users we talked with, are not interested to have MPI between the SPU. There is only one thing they're looking for, removing the last unused SPU cycle from the pipeline !!! There is no room for anything MPI-like at that level. george. On Mar 22, 2007, at 12:30 PM, Marcus G. Daniels wrote: Hi, Has anyone investigated adding intra chip Cell EIB messaging to OpenMPI? It seems like it ought to work. This paper seems pretty convincing: http://www.cs.fsu.edu/research/reports/TR-061215.pdf ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Cell EIB support for OpenMPI
George Bosilca wrote: All in all we end up with a multi-hundreds KB library which in most of the applications will be only used at 10%. Seems like it ought to be possible to do some coverage analysis for a particular application and figure out what parts of the library (and user code) to make adjacent in memory. Then the 10% could be put in the same overlay. Seems like the EIB is quite fast and can take some abuse in terms of swapping. Moreover, most of the Cell users we talked with, are not interested to have MPI between the SPU. There is only one thing they're looking for, removing the last unused SPU cycle from the pipeline !!! There is no room for anything MPI-like at that level. I imagine that OpenMP might be good option for the Cell and even sounds like maybe there will be a GCC option: http://gcc.gnu.org/ml/gcc-patches/2006-05/msg00987.html ..but even so, there are more existing scientific codes for MPI than OpenMP.Even if the thing was a dog initially, and yielded 2 speed ups instead of 10 compared to typical CPUs, it would still be useful for installations with large Cell deployments that could well be risking underutilization or hogging due to poor tools support. I have not investigated how much of the SPU C library stuff is missing to make OpenMPI compile, but that's at least fixable and independently useful thing to have for Cell users. Marcus