[OMPI users] jobs with more that 2, 500 processes will not even start
About 9 months ago we had a new installation with a system of 1800 cores and at the time we found that jobs with more than 1028 cores would not start. At the time a colleague found that setting OMPI_MCA_plm_rsh_num_concurrent=256 help with the problem. We have now increased our processor count to more than 2700 cores and a job with 2,500 jobs does not start. Is there any advice? Best wishes, Lydia Heck -- Dr E L Heck Senior Computer Manager University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
Re: [OMPI users] jobs with more that 2, 500 processes will not even start
I have experimented a bit more and found that if I set OMPI_MCA_plm_rsh_num_concurrent=1024 a job with more than 2,500 processes will start and run. However when I searched the open-mpi web site for the the variable I could not find any indication. Best wishes, Lydia Heck 15. jobs with more that 2, 500 processes will not even start (Lydia Heck) -- Message: 15 Date: Tue, 14 Dec 2010 16:10:01 + (GMT) From: Lydia Heck Subject: [OMPI users] jobs with more that 2,500 processes will not even start To: us...@open-mpi.org Message-ID: Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII About 9 months ago we had a new installation with a system of 1800 cores and at the time we found that jobs with more than 1028 cores would not start. At the time a colleague found that setting OMPI_MCA_plm_rsh_num_concurrent=256 help with the problem. We have now increased our processor count to more than 2700 cores and a job with 2,500 jobs does not start. Is there any advice? Best wishes, Lydia Heck
[OMPI users] errno=131 ?
One of our programs has got stuck - it has not terminated - with the error messages: mca_btl_tcp_frag_send: writev failed with errno=131. Searching the openmpi web site did not result in a positive hit. What does it mean? I am running 1.2.1r14096 Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] how to select a specific network
I have a setup which contains one set of machines with one nge and one e1000g network and of machines with two e1000g networks configured. I am planning a large run where all these computers will be occupied with one job and the mpi communication should only go over one specific network which is configured over e1000g0 on the first set of machines and on e1000g1 on the second set. I cannot use - for obvious reasons to either include all of e1000g or to exclude part of e1000g - if that is possible. So I have to exclude or include on the internet number range. Is there an obvious flag - which I have not yet found - to tell mpirun to use one specific network? Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
Re: [OMPI users] how to select a specific network
I should have added that the two networks are not routable, and that they are private class B. On Fri, 11 Jan 2008, Lydia Heck wrote: > > I have a setup which contains one set of machines > with one nge and one e1000g network and of machines > with two e1000g networks configured. I am planning a > large run where all these computers will be occupied > with one job and the mpi communication should only go > over one specific network which is configured over > e1000g0 on the first set of machines and on e1000g1 on the > second set. I cannot use - for obvious reasons to either > include all of e1000g or to exclude part of e1000g - if that is > possible. > So I have to exclude or include on the internet number range. > > Is there an obvious flag - which I have not yet found - to tell > mpirun to use one specific network? > > Lydia > > -- > Dr E L Heck > > University of Durham > Institute for Computational Cosmology > Ogden Centre > Department of Physics > South Road > > DURHAM, DH1 3LE > United Kingdom > > e-mail: lydia.h...@durham.ac.uk > > Tel.: + 44 191 - 334 3628 > Fax.: + 44 191 - 334 3645 > ___ > -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
Re: [OMPI users] users Digest, Vol 787, Issue 1
Hi Adrian, you guessed right: it is solaris. . The user file space is shared and so ~/.openmpi is the same on all machines. . I cannot disable the "unwanted" interface because is is carrying all the other services such as NIS, NFS etc So the only way is to address the network by it IP number range. The question therefore is: can that be done? I have looked through the description of the MCAs and I fear that I could not find an indication for that. Lydia > > Date: Fri, 11 Jan 2008 13:34:16 +0100 > From: a...@drcomp.erfurt.thur.de (Adrian Knoth) > Subject: Re: [OMPI users] how to select a specific network > To: Open MPI Users > Message-ID: <2008023416.gq11...@ltw.loris.tv> > Content-Type: text/plain; charset=iso-8859-1 > > On Fri, Jan 11, 2008 at 11:36:23AM +, Lydia Heck wrote: > > > I have a setup which contains one set of machines > > with one nge and one e1000g network and of machines > > with two e1000g networks configured. I am planning a > > Are we talking about shared filesystems or can you place different > ~/.openmpi/mca-params.confs across different machines? If so, just > specify the interfaces you want to exclude/include on each machine. > > If nothing helps, either shutdown the unnecessary interfaces or use > interface renaming. > > nge sounds like Solaris, unfortunately I'm not common with it. Under > Linux, one would rename either the required or the unwanted interfaces, > depending if you include or exclude. > > We have something like this: > > adi@amun:~$ ip r s > 192.168.4.0/24 dev ethmp proto kernel scope link src 192.168.4.130 > 192.168.3.0/24 dev ethmp proto kernel scope link src 192.168.3.130 > 192.168.1.0/24 dev ethsvc proto kernel scope link src 192.168.1.130 > default via 192.168.1.12 dev ethsvc > > The "ethmp" is "ethernet message passing", "ethsvc" is "ethernet service > network". That's more or less the same you want: a dedicated network for > message passing. > > So you would obviously include ethmp in your mca-params.conf file. > > > Under Linux, the tool to rename interfaces is called "nameif", but I > guess it cannot be used for Solaris (interface names are kernel space, > and Linux kernel != Solaris kernel). > > > HTH > > -- > Cluster and Metacomputing Working Group > Friedrich-Schiller-Universit?t Jena, Germany > > private: http://adi.thur.de > > > -- > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > End of users Digest, Vol 787, Issue 1 > * > -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error
In one of our big runs (512 cpus) the code fails and produces on a list of nodes the following type of error: I have searched the FAQs but could not find an answer there. There are difficulties getting the code to run because of its shear size but there is no other indication of the problem. Does the following error message mean the some of the nodes have given up? mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error ([361eca8[m2234][0,1,283][m2317, 16][0,) 1Bad address,422(3) ][[ /ws/hpc-ct-7.1/builds/7.1.build-ct7.1-003c/ompi-ct7.1/ompi/mca/btl/tcp/btl_tcp_frag.c:114:mca_btl_tcp _frag_send] /ws/hpc-ct-7.1/builds/7.1.build-ct7.1-003c/ompi-ct7.1/ompi/mca/btl/tcp/btl_tcp_frag.c[m22 41][0,1,430][m2140[m2152][0,1,150][mca_btl_tcp_frag_send: writev error (3c759a8, 16) Bad address(3) Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] gadget-3 locks up using openmpi and infiniband (or myrinet)
One of the big cosmology codes is Gadget-3 (Springel et al). The code uses MPI for interprocess communications. At the ICC in Durham we use OpenMPI and have been using it for ~3 years. At the ICC Gadget-3 is one of the major research codes and we have been running it since it was written and we have observed something which is very worrying: When running over gigabit using -mca btl tcp,self,sm the code runs alright, which is good as the largest part of our cluster is over gigabit, and as Gadget-3 scales rather well, the penalty for running over gigabit is not prohibitive. We also have a myrinet cluster and on there larger runs freeze. However as the gigabit cluster was available we have not really investigated this until just now. We currently have access to an infiniband cluster and we found the following: in a specfic set of blocked sendrecv section it seems to communicate in pairs until in the end there is only one pair left processes where it deadlocks. For that pair the processes have setup communications, they know about each other's IDs, they know what datatype to communicate but never communicate that data. The precise timing in the running is not pinable, i.e. in consecutive runs it does not freeze at the same point in the run. This is using openmpi and it propagated over different versions of openmpi (judging from our myrinet experience). I should mention that the communication on either the myrinet cluster or the infiniband cluster do work properly as runs of other codes (castep, b_eff) show. So my question(s) is (are): has anybody had similar experiences and/or would anybody have an idea why this could happen and/or what we could do about it? Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] error in (Open MPI) 1.3.3r21324-ct8.2-b09b-r31
We are running Sun's build of Open Mpi 1.3.3r21324-ct8.2-b09b-r31 (HPC8.2) and one code that runs perfectly fine under HPC8.1 (Open MPI) 1.3r19845-ct8.1-b06b-r21 and before fails with [oberon:08454] *** Process received signal *** [oberon:08454] Signal: Segmentation Fault (11) [oberon:08454] Signal code: Address not mapped (1) [oberon:08454] Failing at address: 0 /opt/SUNWhpc/HPC8.2/sun/lib/amd64/libopen-pal.so.0.0.0:0x4b89e /lib/amd64/libc.so.1:0xd0f36 /lib/amd64/libc.so.1:0xc5a72 0x0 [ Signal 11 (SEGV)] /opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi.so.0.0.0:MPI_Alloc_mem+0x7f /opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi.so.0.0.0:MPI_Sendrecv_replace+0x31e /opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi_f77.so.0.0.0:PMPI_SENDRECV_REPLACE+0x94 /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:mpi_cyclic_transfer_+0xd9 /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:cycle_particles_and_interpolate_+0x94b /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:interpolate_field_+0xc30 /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:MAIN_+0xe68 /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:main+0x3d /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:0x62ac [oberon:08454] *** End of error message *** -- mpirun noticed that process rank 0 with PID 8454 on node oberon exited on signal 11 (Segmentation Fault). I have not tried to get and build a newer Open Mpi, so I do not know if the problem propagates into the more recent versions. If the developers are interested, I could ask the user to prepare the code for you to have a look at the problem which looks like to be in MPI_Alloc_mem. Best wishes, Lydia Heck -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] using the carto facility
I was advised for a benchmark to use the OPAL carto option to assign specific cores to a job. I searched the web for an example but have only found one set of man pages, which is rather cryptic and needs the knowledge of the programmer rather than an end user. Has anybody out there used this option and if so would you be prepared to share an example which could be adapted for a shared memory system with silions of cores. Thanks. Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] sed: command garbled:
I am trying to build openmpi-1.1.2 for Solaris x86/64 with the studio11 compilers and including the mx drivers. I have gone past some hurdles. However when the configure script nears its end where Makefiles are prepared I get error messages of the form: config.status: creating ompi/mca/osc/rdma/Makefile sed: command garbled: s,@OMPI_CXX_ABSOLUTE@,/opt/studio11/SUNWspro/bin/CC sed: command garbled: s,@OMPI_F90_ABSOLUTE@,/opt/studio11/SUNWspro/bin/f95 sed: command garbled: s,@OMPI_CC_ABSOLUTE@,/opt/studio11/SUNWspro/bin/cc config.status: creating ompi/mca/pml/cm/Makefile This is with the system's sed command. I have tried to use the gnu sed command and get instead sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command config.status: creating orte/Makefile sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command config.status: creating orte/include/Makefile sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command config.status: creating orte/etc/Makefile sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command config.status: creating orte/tools/orted/Makefile sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command Is there anything I do overlook? Lydia
Re: [OMPI users] sed: command garbled:
My apologies I forgot to attach the config.log file. On Thu, 21 Sep 2006, Lydia Heck wrote: > > I am trying to build openmpi-1.1.2 for Solaris x86/64 with the studio11 > compilers and including the mx drivers. I have gone past some hurdles. > However when the configure script nears its end where Makefiles are prepared > I get error messages of the form: > > config.status: creating ompi/mca/osc/rdma/Makefile > sed: command garbled: s,@OMPI_CXX_ABSOLUTE@,/opt/studio11/SUNWspro/bin/CC > sed: command garbled: s,@OMPI_F90_ABSOLUTE@,/opt/studio11/SUNWspro/bin/f95 > sed: command garbled: s,@OMPI_CC_ABSOLUTE@,/opt/studio11/SUNWspro/bin/cc > config.status: creating ompi/mca/pml/cm/Makefile > > > This is with the system's sed command. > > I have tried to use the > gnu sed command and get instead > > > sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command > sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command > config.status: creating orte/Makefile > sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command > sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command > config.status: creating orte/include/Makefile > sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command > sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command > config.status: creating orte/etc/Makefile > sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command > sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command > config.status: creating orte/tools/orted/Makefile > sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command > sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command > > > > Is there anything I do overlook? > > Lydia > > -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___ openmpi-config.log.gz Description: config.log of attempted configuration
[OMPI users] openmpi 1.3a1r12121 ...
I know that with 1.3a1 I a looking at a development release. HOwever I do need the SGE (GridEngine) support and I could not find a download for a stable (or any other) 1.2 release. So I downloaded 1.3a1r12121 and tried to configure it. In my configuration I use --with-mx=/opt/mx (where the MX software is installed); also --with-mx-libdir=/opt/mx/lib64, because I build for 64 bit only. Then I use the Sun Studio11 compilers and the configuration fails with --- MCA component btl:mx (m4 configuration macro) checking for MCA component btl:mx compile mode... dso checking myriexpress.h usability... no checking myriexpress.h presence... no checking for myriexpress.h... no configure: error: MX support requested but not found. Aborting I have tried everything, entering under CFLAGS etc -I/opt/mx/include modifying --with-mx=/opt/mx/include each the configure fails with the same error. Yes, mx is definitely installed, and yes the path to mx is definitely /opt/mx ... Any ideas Lydia Heck -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
Re: [OMPI users] openmpi 1.3a1r12121 ...
I have attached the config.log file. Here is also the instructions which I have included in the configuration In previous configuration attempts I had --with-mx=/opt/mx where /opt/mx is the toplevel directory under which mx is installed. The result of the configuration attempt was the same with the same error messages. #!/bin/ksh CC="/opt/studio11/SUNWspro/bin/cc" CFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include" LDFLAGS="-xarch=amd64a -I/opt/mx/include -L/opt/mx/lib64 \ -L/opt/SUNWsge/lib/sol-amd64 -R/opt/mx/lib64 -R/opt/SUNWsge/lib/sol-amd64" CXX="/opt/studio11/SUNWspro/bin/CC" CXXFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include" F77="/opt/studio11/SUNWspro/bin/f95" FFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include" FC="/opt/studio11/SUNWspro/bin/f95" FCFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include" PATH=/opt/studio11/SUNWspro/bin:/opt/csw/bin:/opt/sfw/bin:/usr/sfw/bin:"$PATH":/usr/ucb export CC CFLAGS LDFLAGS CXX CXXFLAGS F77 FFLAGS FC FCFLAGS PATH ./configure --prefix=/opt/openMPI --with-mx=/opt/mx/lib64 \ --with-mx-libdir=/opt/mx/lib64 \ --with-wrapper-cflags=-xarch=amd64a \ --with-wrapper-cxxflags=-xarch=amd64a \ --with-wrapper-fflags=-xarch=amd64a \ --with-wrapper-fcflags=-xarch=amd64a \ --with-wrapper-ldflags=-xarch=amd64a \ --enable-mpirun-prefix-by-default \ --enable-dependency-tracking \ --enable-cxx-exceptions \ --enable-smp-locks \ --enable-mpi-threads \ --enable-progress-threads \ --with-threads=solaris On Tue, 17 Oct 2006, Lydia Heck wrote: > > I know that with 1.3a1 I a looking at a development release. > HOwever I do need the SGE (GridEngine) support and I could not find > a download for a stable (or any other) 1.2 release. > > So I downloaded 1.3a1r12121 and tried to configure it. > > In my configuration I use > > --with-mx=/opt/mx (where the MX software is installed); also > --with-mx-libdir=/opt/mx/lib64, because I build for 64 bit only. > > Then I use the Sun Studio11 compilers and the configuration fails > > with > > > --- MCA component btl:mx (m4 configuration macro) > checking for MCA component btl:mx compile mode... dso > checking myriexpress.h usability... no > checking myriexpress.h presence... no > checking for myriexpress.h... no > > > configure: error: MX support requested but not found. Aborting > > > I have tried everything, entering under CFLAGS etc > -I/opt/mx/include > > modifying > > --with-mx=/opt/mx/include > > each the configure fails with the same error. > > Yes, mx is definitely installed, and yes the path to mx is definitely > /opt/mx ... > > Any ideas > > Lydia Heck > > > -- > Dr E L Heck > > University of Durham > Institute for Computational Cosmology > Ogden Centre > Department of Physics > South Road > > DURHAM, DH1 3LE > United Kingdom > > e-mail: lydia.h...@durham.ac.uk > > Tel.: + 44 191 - 334 3628 > Fax.: + 44 191 - 334 3645 > ___ > -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___ openmpi-config.log.gz Description: config.log of the configuration with mx
[OMPI users] job fails to terminate
I have recently installed openmpi 1.3r1212a over tcp and gigabit on a Solaris 10 x86/64 system. The compilation of some test codes monte (a monte carlo estimate of pi), connectivity which test connectivity between processes and nodes prime, which calculates prime numbers (these testcode are examples which are bundled with Sun HPC). compile fine using the openmpi version of mpicc, mpif95 and mpic++ And sometimes the jobs work fine, but most of the time the jobs freeze leaving zombies behind. my run time command is mpirun --hostfile my-hosts -mca pls_rsh_agent rsh --mca btl tcp,self -np 14 \ monte and I get as output oberon(209) > mpirun --hostfile my-hosts -mca pls_rsh_agent rsh --mca btl tcp,self -np 14 monte Monte-Carlo estimate of pi by 14 processes is 3.141503. with the cursor hanging. The process table shows oberon# ps -eaf | grep dph0elh dph0elh 9583 7445 7 17:45:01 pts/26 9:22 mpirun --hostfile my-hosts -mca pls_rsh_agent rsh --mca btl tcp,self -np 14 mon dph0elh 9595 9588 0- ? 0:02 dph0elh 9588 1 7 17:45:01 ?? 9:03 orted --bootproxy 1 --name 0.0.1 --num_procs 5 --vpid_start 0 --nodename oberon dph0elh 7445 6924 0 17:01:38 pts/26 0:00 -tcsh root 9656 4151 0 18:01:31 pts/36 0:00 grep dph0elh dph0elh 9593 9588 0- ? 0:02 one of the nodes offers 8 cpus the other nodes in the hostfile offer 2. There are a total of 14 cpus available. and as you can see from the command line I use --mca btl tcp,self There are no other interconnects. I could not find any entry in the FAQs, except for the advice on using --mca btl tcp,self. -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
Re: [OMPI users] job fails to terminate
In answer to Ralph's request and question. Indeed the version number was incorrect it should have been openmpi-1.3a1r12121 my configure command is #!/bin/ksh CC="/opt/studio11/SUNWspro/bin/cc" CFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include" LDFLAGS="-xarch=amd64a -I/opt/mx/include -L/opt/SUNWsge/lib/sol-amd64 -R/opt/mx/lib64 -R/opt/SUNWsge/lib/sol-amd64" CXX="/opt/studio11/SUNWspro/bin/CC" CXXFLAGS="-xarch=amd64a -I/opt/SUNWsge/include" F77="/opt/studio11/SUNWspro/bin/f95" FFLAGS="-xarch=amd64a -I/opt/SUNWsge/include" FC="/opt/studio11/SUNWspro/bin/f95" FCFLAGS="-xarch=amd64a -I/opt/SUNWsge/include" PATH=/opt/studio11/SUNWspro/bin:/opt/csw/bin:/opt/sfw/bin:/usr/sfw/bin:"$PATH":/usr/ucb export CC CFLAGS LDFLAGS CXX CXXFLAGS F77 FFLAGS FC FCFLAGS PATH ./configure --prefix=/opt/openMPI-GB \ --with-wrapper-cflags=-xarch=amd64a \ --with-wrapper-cxxflags=-xarch=amd64a \ --with-wrapper-fflags=-xarch=amd64a \ --with-wrapper-fcflags=-xarch=amd64a \ --with-wrapper-ldflags=-xarch=amd64a \ --enable-mpirun-prefix-by-default \ --enable-dependency-tracking \ --enable-cxx-exceptions \ --enable-smp-locks \ --enable-mpi-threads \ --enable-progress-threads \ --with-threads=solaris Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
Re: [OMPI users] users Digest, Vol 411, Issue 2
Hi Ralph, which of the thread options should I remove: > > --enable-mpi-threads \ > > --enable-progress-threads \ > > --with-threads=solaris all of them? Lydia > > -- > > Message: 1 > Date: Fri, 20 Oct 2006 06:30:36 -0600 > From: Ralph H Castain > Subject: Re: [OMPI users] job fails to terminate > To: "Open MPI Users " > Message-ID: > Content-Type: text/plain; charset="US-ASCII" > > Hi Lydia > > Thanks - that does help! > > Could you try this without threads? We have tried to make the system work > with threads, but our testing has been limited. First thing I would try is > to make sure that we aren't hitting a thread-lock. > > Thanks > Ralph > > > > On 10/20/06 2:11 AM, "Lydia Heck" wrote: > > > > > In answer to Ralph's request and question. > > > > Indeed the version number was incorrect it should have been > > > > openmpi-1.3a1r12121 > > > > my configure command is > > > > #!/bin/ksh > > CC="/opt/studio11/SUNWspro/bin/cc" > > CFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include" > > LDFLAGS="-xarch=amd64a -I/opt/mx/include -L/opt/SUNWsge/lib/sol-amd64 > > -R/opt/mx/lib64 -R/opt/SUNWsge/lib/sol-amd64" > > CXX="/opt/studio11/SUNWspro/bin/CC" > > CXXFLAGS="-xarch=amd64a -I/opt/SUNWsge/include" > > F77="/opt/studio11/SUNWspro/bin/f95" > > FFLAGS="-xarch=amd64a -I/opt/SUNWsge/include" > > FC="/opt/studio11/SUNWspro/bin/f95" > > FCFLAGS="-xarch=amd64a -I/opt/SUNWsge/include" > > > > PATH=/opt/studio11/SUNWspro/bin:/opt/csw/bin:/opt/sfw/bin:/usr/sfw/bin:"$PATH" > > :/usr/ucb > > export CC CFLAGS LDFLAGS CXX CXXFLAGS F77 FFLAGS FC FCFLAGS PATH > > > > ./configure --prefix=/opt/openMPI-GB \ > > --with-wrapper-cflags=-xarch=amd64a \ > > --with-wrapper-cxxflags=-xarch=amd64a \ > > --with-wrapper-fflags=-xarch=amd64a \ > > --with-wrapper-fcflags=-xarch=amd64a \ > > --with-wrapper-ldflags=-xarch=amd64a \ > > --enable-mpirun-prefix-by-default \ > > --enable-dependency-tracking \ > > --enable-cxx-exceptions \ > > --enable-smp-locks \ > > --enable-mpi-threads \ > > --enable-progress-threads \ > > --with-threads=solaris > > > > > > Lydia > > > > -- > > Dr E L Heck > > > > University of Durham > > Institute for Computational Cosmology > > Ogden Centre > > Department of Physics > > South Road > > > > DURHAM, DH1 3LE > > United Kingdom > > > > e-mail: lydia.h...@durham.ac.uk > > > > Tel.: + 44 191 - 334 3628 > > Fax.: + 44 191 - 334 3645 > > ___ > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > -- > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > End of users Digest, Vol 411, Issue 2 > * > -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] btl mx : file not found
I have myricom mx installed and configured and its communications work (using mx commands such as mx_info to check) Then I configured openmpi-1.3a1r12408 with mx and the configuration did give no errors. The built of the openmpi was without problems and it installed properly. I can build and link a program - and ldd shows the openmpi libraries linked accordingly. In order to run applications I set the LD_LIBRARY_PATH and the PATH correctly but the command. ompi_info | grep mx [m2001:12844] mca: base: component_find: unable to open mtl mx: file not found (ignored) [m2001:12844] mca: base: component_find: unable to open btl mx: file not found (ignored) [m2001:12844] mca: base: component_find: unable to open mtl mx: file not found (ignored) and indeed the job does not run, if I give the instruction -mca btl mx Any idea why this should happen? Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___ openmpi-1.3a1r12408-config.log.gz Description: config.log of openmpi-1.3a1r1240 with myrinet mx
Re: [OMPI users] btl mx : file not found
I have solved this problem myself. The mx drivers are built using the gcc compilers both in 64 and 32 bit. I was trying to build 64-bit openmpi on the sun and I am afraid I overlooked that I had to give the path to the 64-bit gcc libs EXPLICITLY in the build of the openmpi. These libraries were required for the mx libraries to be linked in correctly otherwise the environment would only look for the 32-bit which led to the mx environment not being fully configured. Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] myrinet mx and openmpi using solaris, sun compilers
I have built the myrinet drivers with gcc or the studio 11 compilers from sun. The following problem appears for both installations. I have tested the myrinet installations using myricoms own test programs. Then I build open-mpi using the studio11 compilers enabling myrinet. All the library paths are correctly set and I can run my test program which is written in C, successfully, if I choose the number of CPUs to be equal the number of nodes, which means one instance of process per node! Each node has 4 CPUs. If I now request the number of CPUs for the run to be larger than the number of nodes I get an error message, which clearly indicates that the openmpi cannot communicate over more than one channel on the myrinet card. However I should be able to communicate over 4 channels at least - colleagues of mine are doing that using mpich and the same type of myrinet card. Any idead why this should happen? the hostfile looks like: m2009 slots=4 m2010 slots=4 but it will provide the same error if the hosts file is m2009 m2010 ompi_info | grep mx 2001(128) > ompi_info | grep mx MCA btl: mx (MCA v1.0, API v1.0.1, Component v1.2) MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2) m2009(160) > /opt/mx/bin/mx_endpoint_info 1 Myrinet board installed. The MX driver is configured to support up to 4 endpoints on 4 boards. === Board #0: Endpoint PID Command Info 15039 0 15544 There are currently 1 regular endpoint open m2001(120) > mpirun -np 6 -hostfile hostsfile -mca btl mx,self b_eff -- Process 0.1.0 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- Process 0.1.2 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- Process 0.1.4 is unable to reach 0.1.4 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- Process 0.1.1 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- Process 0.1.5 is unable to reach 0.1.4 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- Process 0.1.3 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- --
Re: [OMPI users] myrinet mx and openmpi using solaris, sun compilers
Thank you very much. I tried mpirun -np 6 -machinefile ./myh -mca pml cm ./b_eff and to amuse you mpirun -np 6 -machinefile ./myh -mca btl mx,sm,self ./b_eff with myh containing two host names and both commands went swimmingly. To make absolutely sure, I checked the usage of the myrinet ports and on each system 3 myrinet ports were open. Lydia On Mon, 20 Nov 2006 users-requ...@open-mpi.org wrote: > > -- > > Message: 2 > Date: Mon, 20 Nov 2006 20:05:22 + (GMT) > From: Lydia Heck > Subject: [OMPI users] myrinet mx and openmpi using solaris, sun > compilers > To: us...@open-mpi.org > Message-ID: > > Content-Type: TEXT/PLAIN; charset=US-ASCII > > > I have built the myrinet drivers with gcc or the studio 11 compilers from sun. > The following problem appears for both installations. > > I have tested the myrinet installations using myricoms own test programs. > > Then I build open-mpi using the studio11 compilers enabling myrinet. > > All the library paths are correctly set and I can run my test program > which is written in C, successfully, if I choose the number of CPUs to be > equal > the number of nodes, which means one instance of process per node! > > Each node has 4 CPUs. > > If I now request the number of CPUs for the run to be larger than the > number of nodes I get an error message, which clearly indicates > that the openmpi cannot communicate over more than one channel > on the myrinet card. However I should be able to communicate over > 4 channels at least - colleagues of mine are doing that using > mpich and the same type of myrinet card. > > Any idead why this should happen? > > the hostfile looks like: > > m2009 slots=4 > m2010 slots=4 > > > but it will provide the same error if the hosts file is > > m2009 > m2010 > > ompi_info | grep mx > 2001(128) > ompi_info | grep mx > MCA btl: mx (MCA v1.0, API v1.0.1, Component v1.2) > MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2) > m2009(160) > /opt/mx/bin/mx_endpoint_info > 1 Myrinet board installed. > The MX driver is configured to support up to 4 endpoints on 4 boards. > === > Board #0: > Endpoint PID Command Info >15039 > 0 15544 > There are currently 1 regular endpoint open > > > > > m2001(120) > mpirun -np 6 -hostfile hostsfile -mca btl mx,self b_eff > -- > Process 0.1.0 is unable to reach 0.1.0 for MPI communication. > If you specified the use of a BTL component, you may have > forgotten a component (such as "self") in the list of > usable components. > -- > -- > Process 0.1.2 is unable to reach 0.1.0 for MPI communication. > If you specified the use of a BTL component, you may have > forgotten a component (such as "self") in the list of > usable components. > -- > -- > Process 0.1.4 is unable to reach 0.1.4 for MPI communication. > If you specified the use of a BTL component, you may have > forgotten a component (such as "self") in the list of > usable components. > -- > -- > Process 0.1.1 is unable to reach 0.1.0 for MPI communication. > If you specified the use of a BTL component, you may have > forgotten a component (such as "self") in the list of > usable components. > -- > -- > Process 0.1.5 is unable to reach 0.1.4 for MPI communication. > If you specified the use of a BTL component, you may have > forgotten a component (such as "self") in the list of > usable components. > -- > -- > Process 0.1.3 is unable to reach 0.1.0 for MPI communication. > If you specified the use of a BTL component, you may have > forgotten a component (such as "self") in the list of > usable components. > -- > -
[OMPI users] openmpi, mx
I have - again - successfully built and installed mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster version of openmpi is 1.2b1 compiler used: studio11 The code is a benchmark b_eff which runs usually fine - I have used extensively it for benchmarking When I try 192 CPUs I get m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. ... .. .. The myrinet ports have been opened and the job is running as one of the nodes shows ps -eaf | grep dph0elh dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted --bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 - root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff any idea ? Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] openmpi - mx - solaris and Gadget2
Gadget2 - I cannot attach it because it is not publicly available, runs perfectly fine on any number of processes on systems such as Solaris 10 - Sun CT6 gigabit, SUN CT5 and myrinet gm, IBM regatta .. Sorry to be so expansive ... When I run the code on 32 CPUs on openmpi, mx using the studio11 compilers on a solaris x64 system the code works fine, until about the end, when it fails to write all the restart files. When I run the code on 64 CPUs it fails with an error message which is Topnodes=218193 costlimit=0.0890015 countlimit=428.229 Before=44417 After=46281 NTopleaves= 40496 NTopnodes=46281 (space for 347252) desired memory imbalance=2.83425 (limit=100719, needed=114185) Note: the domain decomposition is suboptimum because the ceiling for memory-imbalance is reached work-load balance=1.28529 memory-balance=1.01948 exchange of 0002589387 particles Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR) Failing at addr:5192cbd0 /opt/ompi/lib/libopal.so.0.0.0:opal_backtrace_print+0x10 /opt/ompi/lib/libopal.so.0.0.0:0x99df5 /lib/amd64/libc.so.1:0xcb276 /lib/amd64/libc.so.1:0xc0642 /opt/mx/lib/amd64/libmyriexpress.so:mx__luigi+0xd5 [ Signal 11 (SEGV)] /opt/mx/lib/amd64/libmyriexpress.so:mx_irecv+0x174 /opt/ompi/lib/openmpi/mca_mtl_mx.so:ompi_mtl_mx_irecv+0x116 /opt/ompi/lib/openmpi/mca_pml_cm.so:mca_pml_cm_irecv+0x27b /opt/ompi/lib/libmpi.so.0.0.0:PMPI_Irecv+0x1ae /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_exchange+0x11b7 /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_decompose+0x4da /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_Decomposition+0x467 /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:run+0x9f /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:main+0x191 /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc *** End of error message *** 63 additional processes aborted (not shown) m2001(26) > /opt/ompi/bin/mpirun -np 32 -machinefile ./myh-all -mca pml cm ./Gadget2 param.txt As this is one of our predominant production codes, I need to make sure that it is running on any system which I install. Any idea would be welcome. Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
Re: [OMPI users] openmpi - mx - solaris and Gadget2
The same run on 32 CPUs almost completes, starting to write 32 re-start files and fails with the same problem: Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR) Failing at addr:33 /opt/ompi/lib/libopal.so.0.0.0:opal_backtrace_print+0x10 /opt/ompi/lib/libopal.so.0.0.0:0x99df5 /lib/amd64/libc.so.1:0xcb276 /lib/amd64/libc.so.1:0xc0642 /opt/mx/lib/amd64/libmyriexpress.so:0x102c7 [ Signal 11 (SEGV)] /opt/mx/lib/amd64/libmyriexpress.so:mx__luigi+0x3d /opt/mx/lib/amd64/libmyriexpress.so:mx__test_common+0x22 /opt/mx/lib/amd64/libmyriexpress.so:mx_test+0x37 /opt/ompi/lib/openmpi/mca_mtl_mx.so:ompi_mtl_mx_send+0x288 /opt/ompi/lib/openmpi/mca_pml_cm.so:mca_pml_cm_send+0x3fc /opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_sendrecv_actual_localcompleted+0x85 /opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_barrier_intra_recursivedoubling+0x1a3 /opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_barrier_intra_dec_fixed+0x44 /opt/ompi/lib/libmpi.so.0.0.0:MPI_Barrier+0x9d /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:restart+0x9a0 /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:run+0x219 /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:main+0x191 /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc *** End of error message *** mv: cannot access ./restart.20 31 additional processes aborted (not shown) m2001(27) > On Thu, 23 Nov 2006, Lydia Heck wrote: > > Gadget2 - I cannot attach it because it is not publicly available, > runs perfectly fine on any number of processes on systems such > as Solaris 10 - Sun CT6 gigabit, SUN CT5 and myrinet gm, IBM regatta .. > > Sorry to be so expansive ... > > When I run the code on 32 CPUs on openmpi, mx using the studio11 compilers > on a solaris x64 system the code works fine, until about the end, when > it fails to write all the restart files. > > When I run the code on 64 CPUs it fails with an error message which is > > Topnodes=218193 costlimit=0.0890015 countlimit=428.229 > Before=44417 > After=46281 > NTopleaves= 40496 NTopnodes=46281 (space for 347252) > desired memory imbalance=2.83425 (limit=100719, needed=114185) > Note: the domain decomposition is suboptimum because the ceiling for > memory-imbalance is reached > work-load balance=1.28529 memory-balance=1.01948 > exchange of 0002589387 particles > Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR) > Failing at addr:5192cbd0 > /opt/ompi/lib/libopal.so.0.0.0:opal_backtrace_print+0x10 > /opt/ompi/lib/libopal.so.0.0.0:0x99df5 > /lib/amd64/libc.so.1:0xcb276 > /lib/amd64/libc.so.1:0xc0642 > /opt/mx/lib/amd64/libmyriexpress.so:mx__luigi+0xd5 [ Signal 11 (SEGV)] > /opt/mx/lib/amd64/libmyriexpress.so:mx_irecv+0x174 > /opt/ompi/lib/openmpi/mca_mtl_mx.so:ompi_mtl_mx_irecv+0x116 > /opt/ompi/lib/openmpi/mca_pml_cm.so:mca_pml_cm_irecv+0x27b > /opt/ompi/lib/libmpi.so.0.0.0:PMPI_Irecv+0x1ae > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_exchange+0x11b7 > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_decompose+0x4da > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_Decomposition+0x467 > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:run+0x9f > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:main+0x191 > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc > *** End of error message *** > 63 additional processes aborted (not shown) > m2001(26) > /opt/ompi/bin/mpirun -np 32 -machinefile ./myh-all -mca pml cm > ./Gadget2 param.txt > > As this is one of our predominant production codes, I need to make sure > that it is running on any system which I install. Any idea would be welcome. > > Lydia > > > > -- > Dr E L Heck > > University of Durham > Institute for Computational Cosmology > Ogden Centre > Department of Physics > South Road > > DURHAM, DH1 3LE > United Kingdom > > e-mail: lydia.h...@durham.ac.uk > > Tel.: + 44 191 - 334 3628 > Fax.: + 44 191 - 334 3645 > ___ > -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
Re: [OMPI users] openmpi - mx - solaris and Gadget2 - add on
I saved two cores, which might be of interest. However they are so large, that I cannot attach them to any email. But I am very willing to submit them, if requested. Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] problem building openmpi-1.2b1r12657
The configuration of openmpi-1.2b1r12657 goes fine. When I try to build I get somewhere are into the buid the following error message. DEPDIR=.deps depmode=none /bin/bash ../../../../config/depcomp \ /bin/bash ../../../../libtool --tag=CC --mode=compile /opt/studio11/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I. ./../../../opal/include -I../../../../orte/include -I../../../../ompi/include -I../../../../ompi/include -I.. /../../..-DNDEBUG -g -O -xtarget=opteron -xarch=amd64 -mt -c -o common_sm_mmap.lo common_sm_mmap.c libtool: compile: /opt/studio11/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I../../../../opal/include -I../../../ ../orte/include -I../../../../ompi/include -I../../../../ompi/include -I../../../.. -DNDEBUG -g -O -xtarget=opt eron -xarch=amd64 -mt -c common_sm_mmap.c -KPIC -DPIC -o .libs/common_sm_mmap.o Assembler: common_sm_mmap.c "/tmp/IAAztaqgp", line 11799 : Trouble closing elf file cc: ube failed for common_sm_mmap.c gmake[2]: *** [common_sm_mmap.lo] Error 1 gmake[2]: Leaving directory `/hpcconsole-1/SOFTWARE/openmpi-1.2b1r12657/ompi/mca/common/sm' gmake[1]: *** [all-recursive] Error 1 gmake[1]: Leaving directory `/hpcconsole-1/SOFTWARE/openmpi-1.2b1r12657/ompi' gmake: *** [all-recursive] Error 1 I know that this is in development, but the openmpi-1.2b1 fails to run one our major codes. So I hoped that with the more recent version I would be more successful Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
Re: [OMPI users] problem building openmpi-1.2b1r12657
My apologies This was a red herring. It turned out that I had filled the disk. It so happened that the same error was repeated several time, even after reconfiguring. Lydia On Sat, 25 Nov 2006, Lydia Heck wrote: > > The configuration of openmpi-1.2b1r12657 goes fine. > When I try to build I get somewhere are into the buid the following > error message. > > > DEPDIR=.deps depmode=none /bin/bash ../../../../config/depcomp \ > /bin/bash ../../../../libtool --tag=CC --mode=compile > /opt/studio11/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I. > ./../../../opal/include -I../../../../orte/include -I../../../../ompi/include > -I../../../../ompi/include -I.. > /../../..-DNDEBUG -g -O -xtarget=opteron -xarch=amd64 -mt -c -o > common_sm_mmap.lo common_sm_mmap.c > libtool: compile: /opt/studio11/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. > -I../../../../opal/include -I../../../ > ../orte/include -I../../../../ompi/include -I../../../../ompi/include > -I../../../.. -DNDEBUG -g -O -xtarget=opt > eron -xarch=amd64 -mt -c common_sm_mmap.c -KPIC -DPIC -o > .libs/common_sm_mmap.o > Assembler: common_sm_mmap.c > "/tmp/IAAztaqgp", line 11799 : Trouble closing elf file > cc: ube failed for common_sm_mmap.c > gmake[2]: *** [common_sm_mmap.lo] Error 1 > gmake[2]: Leaving directory > `/hpcconsole-1/SOFTWARE/openmpi-1.2b1r12657/ompi/mca/common/sm' > gmake[1]: *** [all-recursive] Error 1 > gmake[1]: Leaving directory `/hpcconsole-1/SOFTWARE/openmpi-1.2b1r12657/ompi' > gmake: *** [all-recursive] Error 1 > > > I know that this is in development, but the openmpi-1.2b1 > fails to run one our major codes. So I hoped that with the more > recent version I would be more successful > > Lydia > > > -- > Dr E L Heck > > University of Durham > Institute for Computational Cosmology > Ogden Centre > Department of Physics > South Road > > DURHAM, DH1 3LE > United Kingdom > > e-mail: lydia.h...@durham.ac.uk > > Tel.: + 44 191 - 334 3628 > Fax.: + 44 191 - 334 3645 > ___ > -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
Re: [OMPI users] users Digest, Vol 443, Issue 1
You have to make sure that the path to the gm libraries is fully set at runtime of your code: LD_LIBRARY_PATH="$PATH":/xx/gm/lib and of course xx stands for the location of your path to the where the gm directory is located. Also for better performance you might want to use the sun compilers for f77 as well. export F77=/opt/SUNWspro/bin/f95 export FC=/opt/SUNWspro/bin/f95 Lydia > > Message: 3 > Date: Sat, 25 Nov 2006 22:15:07 -0400 > From: brem...@unb.ca > Subject: [OMPI users] Myrinet/GM can't find any NICs > To: us...@open-mpi.org > Message-ID: <0tk61juf6c.wl%brem...@pivot.cs.unb.ca> > Content-Type: text/plain; charset=US-ASCII > > > Dear experts; > > I built openmpi-1.2b1 on solaris x86, enabling GM. Test jobs seem to > run OK, but I assume it is falling back on TCP over ethernet. > On of the following messages for each node. > (The output from ompi_info follows; config.log and the full output can > be found at http://www.cs.unb.ca/~bremner/openmpi) > > [cl023:14729] [0,1,1] gm_port 0828CBA8, board 0, global 3712481415 node 1 > port 4 > [cl023:14729] [mpool_gm_module.c:100] error(32) registering gm memory > [cl023:14729] [mpool_gm_module.c:100] error(32) registering gm memory > [cl023:14729] [mpool_gm_module.c:100] error(32) registering gm memory > [cl023:14729] [btl_gm_component.c:409] unable to initialze gm port > [cl023:14727] [0,1,0] gm_port 0828CBA8, board 0, global 3712481415 node 1 > port 5 > [cl023:14727] [mpool_gm_module.c:100] error(32) registering gm memory > [cl023:14727] [mpool_gm_module.c:100] error(32) registering gm memory > [cl023:14727] [mpool_gm_module.c:100] error(32) registering gm memory > [cl023:14727] [btl_gm_component.c:409] unable to initialze gm port > -- > [0,1,0]: Myrinet/GM on host cl023 was unable to find any NICs. > Another transport will be used instead, although this may result in > lower performance. > -- > > > > Open MPI: 1.2b1 >Open MPI SVN revision: r12562 > Open RTE: 1.2b1 >Open RTE SVN revision: r12562 > OPAL: 1.2b1 >OPAL SVN revision: r12562 > Prefix: /home/dbremner/pkg/openmpi-1.2b1-gm > Configured architecture: i386-pc-solaris2.10 >Configured by: >Configured on: Sat Nov 25 16:56:01 AST 2006 > Configure host: clhead > Built by: dbremner > Built on: Saturday November 25 17:16:33 AST 2006 > Built host: clhead > C bindings: yes > C++ bindings: yes > Fortran77 bindings: yes (all) > Fortran90 bindings: no > Fortran90 bindings size: na > C compiler: gcc > C compiler absolute: /home/dbremner/bin/gcc > C++ compiler: g++ >C++ compiler absolute: /home/dbremner/bin/g++ > Fortran77 compiler: g77 > Fortran77 compiler abs: /opt/sfw/gcc-2/bin/g77 > Fortran90 compiler: f95 > Fortran90 compiler abs: /opt/SUNWspro/bin/f95 > C profiling: yes >C++ profiling: yes > Fortran77 profiling: yes > Fortran90 profiling: no > C++ exceptions: no > Thread support: solaris (mpi: no, progress: no) > Internal debug support: no > MPI parameter check: runtime > Memory profiling support: no > Memory debugging support: no > libltdl support: yes > mpirun default --prefix: no >MCA backtrace: printstack (MCA v1.0, API v1.0, Component v1.2) >MCA paffinity: solaris (MCA v1.0, API v1.0, Component v1.2) >MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2) >MCA timer: solaris (MCA v1.0, API v1.0, Component v1.2) >MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) >MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) > MCA coll: basic (MCA v1.0, API v1.0, Component v1.2) > MCA coll: self (MCA v1.0, API v1.0, Component v1.2) > MCA coll: sm (MCA v1.0, API v1.0, Component v1.2) > MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2) > MCA io: romio (MCA v1.0, API v1.0, Component v1.2) >MCA mpool: gm (MCA v1.0, API v1.0, Component v1.2) >MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2) >MCA mpool: udapl (MCA v1.0, API v1.0, Component v1.2) > MCA pml: cm (MCA v1.0, API v1.0, Component v1.2) > MCA pml: dr (MCA v1.0, API v1.0, Component v1.2) > MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2) > MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2) > MCA rcache: rb (MCA v1.0, API v1.0, Component v1.2) > MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2) > MCA btl: gm (MCA v1.0, API v1.0.1, Component v1.2) >
[OMPI users] openmpi 1.2b1(r12657)
I am running the benchmark b_eff on a mulitprocessor opteron based system. The benchmark measures throughput. And the benchmark runs fine over tcp/ip and myrinet on cluster of 2 a 4 cores. When I run the application on an 8core system over 2 cpus the run is fine. When I run it over say 4 or more I get the error: /opt/ompi/bin/mpirun -np 4 -machinefile myh -mca btl tcp,self b_eff I get sometimes an error such as ERROR - invalid message content after MPI_Alltoallv - myrank=1 i_rep=0 i_msg=15 i_pat=14 i_sr=1 i_loop=5 Msglng=71468 buf=(16 0)!=(16 31) But not always. I searched the FAQs, could not find an entry with a similar error. Any idea? Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] crashed openmpi job fails to clean up ....
A job which crashes with an floating point underflow (or any IEEE floating point exception) fails to clean up after itself using openmpi-1.3a1r12695 .. Nodes with copies of slaves are sitting there ... I also noticed that orted are left behind on other crashed jobs .. Should I have to expect this? Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] SEGV in ompi_coll_tuned_reduce_generic (1.2b4r13488)
When running either over myrinet or over gigabit one of our codes (Gagdet2) it fails predictably with the following error message. >From the back trace it looks as if the SEGV is in ompi_coll_tuned_reduce_generic. Have there been similar reportings and/or is there a fix for this? Lydia Heck [m2042:08002] *** Process received signal *** [m2042:08002] Signal: Segmentation Fault (11) [m2042:08002] Signal code: Address not mapped (1) [m2042:08002] Failing at address: 92 /opt/OMPI/ompi-1.2b4r13488/lib/libopen-pal.so.0.0.0:opal_backtrace_print+0x26 /opt/OMPI/ompi-1.2b4r13488/lib/libopen-pal.so.0.0.0:0xc3874 /lib/amd64/libc.so.1:0xcb686 /lib/amd64/libc.so.1:0xc0a52 /opt/OMPI/ompi-1.2b4r13488/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_reduce_generic+0x11b [ Signal 11 (SEGV)] /opt/OMPI/ompi-1.2b4r13488/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_reduce_intra_binary+0x162 /opt/OMPI/ompi-1.2b4r13488/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_reduce_intra_dec_fixed+0x28d /opt/OMPI/ompi-1.2b4r13488/lib/libmpi.so.0.0.0:PMPI_Reduce+0x3f6 /data/4/nil/tak_gadget/gadget2/P-Gadget2:gravity_tree+0x146c /data/4/nil/tak_gadget/gadget2/P-Gadget2:compute_accelerations+0x7e /data/4/nil/tak_gadget/gadget2/P-Gadget2:run+0xa5 /data/4/nil/tak_gadget/gadget2/P-Gadget2:main+0x22f /data/4/nil/tak_gadget/gadget2/P-Gadget2:0x7c3c [m2042:08002] *** End of error message *** [m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_gridengine_module.c at line 793 [m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 mpirun noticed that job rank 2 with PID 0 on node m2043 exited on signal 11 (Segmentation Fault). [m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_gridengine_module.c at line 828 -- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
[OMPI users] MPI reduce ...
I was asked by a user if the MPI allreduce recognizes when process ids are situated on the same node so that the communication can then proceed over shared memory rather over the slower networking communication channels. Would anyone of the openmpi developers be able to comment on that question and situation on that in openmpi? Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___