Re: [OMPI users] jobs with more that 2, 500 processes will not even start

2010-12-14 Thread Lydia Heck
I have experimented a bit more and found that if I set OMPI_MCA_plm_rsh_num_concurrent=1024 a job with more than 2,500 processes will start and run. However when I searched the open-mpi web site for the the variable I could not find any indication. Best wishes, Lydia Heck 15. jobs

[OMPI users] jobs with more that 2, 500 processes will not even start

2010-12-14 Thread Lydia Heck
count to more than 2700 cores and a job with 2,500 jobs does not start. Is there any advice? Best wishes, Lydia Heck -- Dr E L Heck Senior Computer Manager University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics

[OMPI users] error in (Open MPI) 1.3.3r21324-ct8.2-b09b-r31

2010-07-15 Thread Lydia Heck
o the more recent versions. If the developers are interested, I could ask the user to prepare the code for you to have a look at the problem which looks like to be in MPI_Alloc_mem. Best wishes, Lydia Heck -- Dr E L Heck University of Durham Institut

[OMPI users] gadget-3 locks up using openmpi and infiniband (or myrinet)

2010-05-16 Thread Lydia Heck
One of the big cosmology codes is Gadget-3 (Springel et al). The code uses MPI for interprocess communications. At the ICC in Durham we use OpenMPI and have been using it for ~3 years. At the ICC Gadget-3 is one of the major research codes and we have been running it since it was written and

[OMPI users] using the carto facility

2009-01-05 Thread Lydia Heck
I was advised for a benchmark to use the OPAL carto option to assign specific cores to a job. I searched the web for an example but have only found one set of man pages, which is rather cryptic and needs the knowledge of the programmer rather than an end user. Has anybody out there used this opt

[OMPI users] mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error

2008-02-02 Thread Lydia Heck
In one of our big runs (512 cpus) the code fails and produces on a list of nodes the following type of error: I have searched the FAQs but could not find an answer there. There are difficulties getting the code to run because of its shear size but there is no other indication of the problem. Doe

Re: [OMPI users] users Digest, Vol 787, Issue 1

2008-01-11 Thread Lydia Heck
I users] how to select a specific network > To: Open MPI Users > Message-ID: <2008023416.gq11...@ltw.loris.tv> > Content-Type: text/plain; charset=iso-8859-1 > > On Fri, Jan 11, 2008 at 11:36:23AM +, Lydia Heck wrote: > > > I have a setup which contains one set

Re: [OMPI users] how to select a specific network

2008-01-11 Thread Lydia Heck
I should have added that the two networks are not routable, and that they are private class B. On Fri, 11 Jan 2008, Lydia Heck wrote: > > I have a setup which contains one set of machines > with one nge and one e1000g network and of machines > with two e1000g networks configured. I

[OMPI users] how to select a specific network

2008-01-11 Thread Lydia Heck
I have a setup which contains one set of machines with one nge and one e1000g network and of machines with two e1000g networks configured. I am planning a large run where all these computers will be occupied with one job and the mpi communication should only go over one specific network which is c

[OMPI users] errno=131 ?

2007-11-18 Thread Lydia Heck
One of our programs has got stuck - it has not terminated - with the error messages: mca_btl_tcp_frag_send: writev failed with errno=131. Searching the openmpi web site did not result in a positive hit. What does it mean? I am running 1.2.1r14096 Lydia

[OMPI users] MPI reduce ...

2007-02-23 Thread Lydia Heck
I was asked by a user if the MPI allreduce recognizes when process ids are situated on the same node so that the communication can then proceed over shared memory rather over the slower networking communication channels. Would anyone of the openmpi developers be able to comment on that question

[OMPI users] SEGV in ompi_coll_tuned_reduce_generic (1.2b4r13488)

2007-02-14 Thread Lydia Heck
When running either over myrinet or over gigabit one of our codes (Gagdet2) it fails predictably with the following error message. >From the back trace it looks as if the SEGV is in ompi_coll_tuned_reduce_generic. Have there been similar reportings and/or is there a fix for this? Lydia H

[OMPI users] crashed openmpi job fails to clean up ....

2006-12-19 Thread Lydia Heck
A job which crashes with an floating point underflow (or any IEEE floating point exception) fails to clean up after itself using openmpi-1.3a1r12695 .. Nodes with copies of slaves are sitting there ... I also noticed that orted are left behind on other crashed jobs .. Should I have to expect t

[OMPI users] openmpi 1.2b1(r12657)

2006-12-10 Thread Lydia Heck
I am running the benchmark b_eff on a mulitprocessor opteron based system. The benchmark measures throughput. And the benchmark runs fine over tcp/ip and myrinet on cluster of 2 a 4 cores. When I run the application on an 8core system over 2 cpus the run is fine. When I run it over say 4 or more I

Re: [OMPI users] users Digest, Vol 443, Issue 1

2006-11-26 Thread Lydia Heck
You have to make sure that the path to the gm libraries is fully set at runtime of your code: LD_LIBRARY_PATH="$PATH":/xx/gm/lib and of course xx stands for the location of your path to the where the gm directory is located. Also for better performance you might want to use the sun compilers fo

Re: [OMPI users] problem building openmpi-1.2b1r12657

2006-11-25 Thread Lydia Heck
My apologies This was a red herring. It turned out that I had filled the disk. It so happened that the same error was repeated several time, even after reconfiguring. Lydia On Sat, 25 Nov 2006, Lydia Heck wrote: > > The configuration of openmpi-1.2b1r12657 goes fine. > When I try

[OMPI users] problem building openmpi-1.2b1r12657

2006-11-25 Thread Lydia Heck
The configuration of openmpi-1.2b1r12657 goes fine. When I try to build I get somewhere are into the buid the following error message. DEPDIR=.deps depmode=none /bin/bash ../../../../config/depcomp \ /bin/bash ../../../../libtool --tag=CC --mode=compile /opt/studio11/SUNWspro/bin/cc -DHAVE_CONF

Re: [OMPI users] openmpi - mx - solaris and Gadget2 - add on

2006-11-24 Thread Lydia Heck
I saved two cores, which might be of interest. However they are so large, that I cannot attach them to any email. But I am very willing to submit them, if requested. Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Ce

Re: [OMPI users] openmpi - mx - solaris and Gadget2

2006-11-23 Thread Lydia Heck
/Gadget2-multidomain/Gadget2:main+0x191 /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc *** End of error message *** mv: cannot access ./restart.20 31 additional processes aborted (not shown) m2001(27) > On Thu, 23 Nov 2006, Lydia Heck wrote: > > Gadget2 - I cannot

[OMPI users] openmpi - mx - solaris and Gadget2

2006-11-23 Thread Lydia Heck
Gadget2 - I cannot attach it because it is not publicly available, runs perfectly fine on any number of processes on systems such as Solaris 10 - Sun CT6 gigabit, SUN CT5 and myrinet gm, IBM regatta .. Sorry to be so expansive ... When I run the code on 32 CPUs on openmpi, mx using the studio11

[OMPI users] openmpi, mx

2006-11-22 Thread Lydia Heck
I have - again - successfully built and installed mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster version of openmpi is 1.2b1 compiler used: studio11 The code is a benchmark b_eff which runs usually fine - I have used extensively it for benchmarking When I try 192 CPUs I

Re: [OMPI users] myrinet mx and openmpi using solaris, sun compilers

2006-11-21 Thread Lydia Heck
ports and on each system 3 myrinet ports were open. Lydia On Mon, 20 Nov 2006 users-requ...@open-mpi.org wrote: > > -- > > Message: 2 > Date: Mon, 20 Nov 2006 20:05:22 + (GMT) > From: Lydia Heck > Subject: [OMPI users] myrinet mx and openm

[OMPI users] myrinet mx and openmpi using solaris, sun compilers

2006-11-20 Thread Lydia Heck
I have built the myrinet drivers with gcc or the studio 11 compilers from sun. The following problem appears for both installations. I have tested the myrinet installations using myricoms own test programs. Then I build open-mpi using the studio11 compilers enabling myrinet. All the library pat

Re: [OMPI users] btl mx : file not found

2006-11-20 Thread Lydia Heck
I have solved this problem myself. The mx drivers are built using the gcc compilers both in 64 and 32 bit. I was trying to build 64-bit openmpi on the sun and I am afraid I overlooked that I had to give the path to the 64-bit gcc libs EXPLICITLY in the build of the openmpi. These libraries were

[OMPI users] btl mx : file not found

2006-11-18 Thread Lydia Heck
I have myricom mx installed and configured and its communications work (using mx commands such as mx_info to check) Then I configured openmpi-1.3a1r12408 with mx and the configuration did give no errors. The built of the openmpi was without problems and it installed properly. I can build and link

Re: [OMPI users] users Digest, Vol 411, Issue 2

2006-10-20 Thread Lydia Heck
> Could you try this without threads? We have tried to make the system work > with threads, but our testing has been limited. First thing I would try is > to make sure that we aren't hitting a thread-lock. > > Thanks > Ralph > > > > On 10/20/06 2:11 AM, "

Re: [OMPI users] job fails to terminate

2006-10-20 Thread Lydia Heck
In answer to Ralph's request and question. Indeed the version number was incorrect it should have been openmpi-1.3a1r12121 my configure command is #!/bin/ksh CC="/opt/studio11/SUNWspro/bin/cc" CFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include" LDFLAGS="-xarch=amd64a -I/opt/m

[OMPI users] job fails to terminate

2006-10-18 Thread Lydia Heck
I have recently installed openmpi 1.3r1212a over tcp and gigabit on a Solaris 10 x86/64 system. The compilation of some test codes monte (a monte carlo estimate of pi), connectivity which test connectivity between processes and nodes prime, which calculates prime numbers (these testcode are exam

Re: [OMPI users] openmpi 1.3a1r12121 ...

2006-10-18 Thread Lydia Heck
ependency-tracking \ --enable-cxx-exceptions \ --enable-smp-locks \ --enable-mpi-threads \ --enable-progress-threads \ --with-threads=solaris On Tue, 17 Oct 2006, Lydia Heck wrote: > > I know that with 1.3a1 I a looking at a development release. > HOwever I do need t

[OMPI users] openmpi 1.3a1r12121 ...

2006-10-17 Thread Lydia Heck
the same error. Yes, mx is definitely installed, and yes the path to mx is definitely /opt/mx ... Any ideas Lydia Heck -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM

Re: [OMPI users] sed: command garbled:

2006-09-21 Thread Lydia Heck
My apologies I forgot to attach the config.log file. On Thu, 21 Sep 2006, Lydia Heck wrote: > > I am trying to build openmpi-1.1.2 for Solaris x86/64 with the studio11 > compilers and including the mx drivers. I have gone past some hurdles. > However when the configure script n

[OMPI users] sed: command garbled:

2006-09-21 Thread Lydia Heck
I am trying to build openmpi-1.1.2 for Solaris x86/64 with the studio11 compilers and including the mx drivers. I have gone past some hurdles. However when the configure script nears its end where Makefiles are prepared I get error messages of the form: config.status: creating ompi/mca/osc/rdma/M