[OMPI users] jobs with more that 2, 500 processes will not even start

2010-12-14 Thread Lydia Heck


About 9 months ago we had a new installation with a system of 1800 cores and at 
the time we found that jobs with more than 1028 cores would not start. At the 
time a colleague found that setting


OMPI_MCA_plm_rsh_num_concurrent=256

help with the problem.

We have now increased our processor count to more than 2700 cores and a job with 
2,500 jobs does not start.


Is there any advice?

Best wishes,

Lydia Heck
--
Dr E L Heck
Senior Computer Manager

University of Durham 
Institute for Computational Cosmology

Ogden Centre
Department of Physics 
South Road


DURHAM, DH1 3LE 
United Kingdom


e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


Re: [OMPI users] jobs with more that 2, 500 processes will not even start

2010-12-14 Thread Lydia Heck


I have experimented a bit more and found that if I set

OMPI_MCA_plm_rsh_num_concurrent=1024

a job with more than 2,500 processes will start and run.

However when I searched the open-mpi web site for the the variable I could not 
find any indication.


Best wishes,
Lydia Heck




 15. jobs with more that 2, 500 processes will not even start
 (Lydia Heck)

--

Message: 15
Date: Tue, 14 Dec 2010 16:10:01 + (GMT)
From: Lydia Heck 
Subject: [OMPI users] jobs with more that 2,500 processes will not
even start
To: us...@open-mpi.org
Message-ID:

Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII


About 9 months ago we had a new installation with a system of 1800 cores and at
the time we found that jobs with more than 1028 cores would not start. At the
time a colleague found that setting

OMPI_MCA_plm_rsh_num_concurrent=256

help with the problem.

We have now increased our processor count to more than 2700 cores and a job with
2,500 jobs does not start.

Is there any advice?

Best wishes,

Lydia Heck



[OMPI users] errno=131 ?

2007-11-18 Thread Lydia Heck

One of our programs has got stuck - it has not terminated -
with the error messages:
mca_btl_tcp_frag_send: writev failed with errno=131.

Searching the openmpi web site did not result in a positive hit.
What does it mean?

I am running 1.2.1r14096

Lydia


--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] how to select a specific network

2008-01-11 Thread Lydia Heck

I have a setup which contains one set of machines
with one nge and one e1000g network and of machines
with two e1000g networks configured. I am planning a
large run where all these computers will be occupied
with one job and the mpi communication should only go
over one specific network which is configured over
e1000g0 on the first set of machines and on e1000g1 on the
second set. I cannot use - for obvious reasons to either
include all of e1000g or to exclude part of e1000g - if that is
possible.
So I have to exclude or include on the internet number range.

Is there an obvious flag - which I have not yet found - to tell
mpirun to use one specific network?

Lydia

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


Re: [OMPI users] how to select a specific network

2008-01-11 Thread Lydia Heck

I should have added that the two networks are not routable,
and that they are private class B.


On Fri, 11 Jan 2008, Lydia Heck wrote:

>
> I have a setup which contains one set of machines
> with one nge and one e1000g network and of machines
> with two e1000g networks configured. I am planning a
> large run where all these computers will be occupied
> with one job and the mpi communication should only go
> over one specific network which is configured over
> e1000g0 on the first set of machines and on e1000g1 on the
> second set. I cannot use - for obvious reasons to either
> include all of e1000g or to exclude part of e1000g - if that is
> possible.
> So I have to exclude or include on the internet number range.
>
> Is there an obvious flag - which I have not yet found - to tell
> mpirun to use one specific network?
>
> Lydia
>
> --
> Dr E L  Heck
>
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
>
> DURHAM, DH1 3LE
> United Kingdom
>
> e-mail: lydia.h...@durham.ac.uk
>
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___
>

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


Re: [OMPI users] users Digest, Vol 787, Issue 1

2008-01-11 Thread Lydia Heck

Hi Adrian,

you guessed right: it is solaris.

. The user file space is shared and so ~/.openmpi is the same on
all machines.
. I cannot disable the "unwanted" interface because is is carrying
all the other services such as NIS, NFS etc

So the only way is to address the network by it IP number range.

The question therefore is: can that be done?
I have looked through the description of the MCAs and I fear that I could
not find an indication for that.

Lydia

>
> Date: Fri, 11 Jan 2008 13:34:16 +0100
> From: a...@drcomp.erfurt.thur.de (Adrian Knoth)
> Subject: Re: [OMPI users] how to select a specific network
> To: Open MPI Users 
> Message-ID: <2008023416.gq11...@ltw.loris.tv>
> Content-Type: text/plain; charset=iso-8859-1
>
> On Fri, Jan 11, 2008 at 11:36:23AM +, Lydia Heck wrote:
>
> > I have a setup which contains one set of machines
> > with one nge and one e1000g network and of machines
> > with two e1000g networks configured. I am planning a
>
> Are we talking about shared filesystems or can you place different
> ~/.openmpi/mca-params.confs across different machines? If so, just
> specify the interfaces you want to exclude/include on each machine.
>
> If nothing helps, either shutdown the unnecessary interfaces or use
> interface renaming.
>
> nge sounds like Solaris, unfortunately I'm not common with it. Under
> Linux, one would rename either the required or the unwanted interfaces,
> depending if you include or exclude.
>
> We have something like this:
>
> adi@amun:~$ ip r s
> 192.168.4.0/24 dev ethmp  proto kernel  scope link  src 192.168.4.130
> 192.168.3.0/24 dev ethmp  proto kernel  scope link  src 192.168.3.130
> 192.168.1.0/24 dev ethsvc  proto kernel  scope link  src 192.168.1.130
> default via 192.168.1.12 dev ethsvc
>
> The "ethmp" is "ethernet message passing", "ethsvc" is "ethernet service
> network". That's more or less the same you want: a dedicated network for
> message passing.
>
> So you would obviously include ethmp in your mca-params.conf file.
>
>
> Under Linux, the tool to rename interfaces is called "nameif", but I
> guess it cannot be used for Solaris (interface names are kernel space,
> and Linux kernel != Solaris kernel).
>
>
> HTH
>
> --
> Cluster and Metacomputing Working Group
> Friedrich-Schiller-Universit?t Jena, Germany
>
> private: http://adi.thur.de
>
>
> --
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 787, Issue 1
> *
>

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error

2008-02-02 Thread Lydia Heck

In one of our big runs (512 cpus) the code fails and produces on a list
of nodes the following type of error:

I have searched the FAQs but could not find an answer there.
There are difficulties getting the code to run because of its shear size
but there is no other indication of the problem.

Does the following error message mean the some of the nodes have given up?


mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error
([361eca8[m2234][0,1,283][m2317, 16][0,)
1Bad address,422(3)
][[
/ws/hpc-ct-7.1/builds/7.1.build-ct7.1-003c/ompi-ct7.1/ompi/mca/btl/tcp/btl_tcp_frag.c:114:mca_btl_tcp
_frag_send]
/ws/hpc-ct-7.1/builds/7.1.build-ct7.1-003c/ompi-ct7.1/ompi/mca/btl/tcp/btl_tcp_frag.c[m22
41][0,1,430][m2140[m2152][0,1,150][mca_btl_tcp_frag_send: writev error (3c759a8,
16)
Bad address(3)


Lydia

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] gadget-3 locks up using openmpi and infiniband (or myrinet)

2010-05-16 Thread Lydia Heck




One of the big cosmology codes is Gadget-3 (Springel et al).
The code uses MPI for interprocess communications. At the ICC in Durham we use 
OpenMPI and have been using it for ~3 years.
At the ICC Gadget-3 is one of the major research codes and we have been running 
it since it was written and we have observed something which is very worrying:


When running over gigabit using -mca btl tcp,self,sm  the code runs alright, 
which is good as the largest part of our cluster is over gigabit, and as 
Gadget-3 scales rather well, the penalty for running over gigabit is not 
prohibitive.
We also have a myrinet cluster and  on  there larger runs freeze. However as 
the gigabit cluster was available we have not really investigated this until 
just now.


We currently have access to an infiniband cluster and we found the following: 
in a specfic set of blocked sendrecv section it seems to communicate in pairs 
until in the end there is only one pair left  processes where it deadlocks. 
For that pair the processes have setup 
communications, they know about each other's IDs, they know what datatype to 
communicate but never communicate that data. The precise timing in the running 
is not pinable, i.e. in consecutive runs it does not freeze at the same point 
in the run. This is using openmpi and it propagated over different versions of 
openmpi (judging from our myrinet experience).


I should mention that the communication on either the myrinet cluster or the 
infiniband cluster do work properly as runs of other codes (castep, b_eff) 
show.


So my question(s) is (are): has anybody had similar experiences and/or would 
anybody have an idea why this could happen and/or what we could do about it?


Lydia


--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] error in (Open MPI) 1.3.3r21324-ct8.2-b09b-r31

2010-07-15 Thread Lydia Heck


We are running Sun's build of Open Mpi  1.3.3r21324-ct8.2-b09b-r31
(HPC8.2) and one code that runs perfectly fine under
HPC8.1 (Open MPI) 1.3r19845-ct8.1-b06b-r21 and before fails with



[oberon:08454] *** Process received signal ***
[oberon:08454] Signal: Segmentation Fault (11)
[oberon:08454] Signal code: Address not mapped (1)
[oberon:08454] Failing at address: 0
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libopen-pal.so.0.0.0:0x4b89e
/lib/amd64/libc.so.1:0xd0f36
/lib/amd64/libc.so.1:0xc5a72
0x0 [ Signal 11 (SEGV)]
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi.so.0.0.0:MPI_Alloc_mem+0x7f
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi.so.0.0.0:MPI_Sendrecv_replace+0x31e
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi_f77.so.0.0.0:PMPI_SENDRECV_REPLACE+0x94
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:mpi_cyclic_transfer_+0xd9
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:cycle_particles_and_interpolate_+0x94b
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:interpolate_field_+0xc30
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:MAIN_+0xe68
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:main+0x3d
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:0x62ac
[oberon:08454] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 8454 on node oberon exited on 
signal 11 (Segmentation Fault).




I have not tried to get and build a newer Open Mpi, so I do not know if the 
problem propagates into the more recent versions.



If the developers are interested, I could ask the user to prepare the code for 
you to have a look at the problem which looks like to be in  MPI_Alloc_mem.


Best wishes,
Lydia Heck

--
Dr E L  Heck

University of Durham 
Institute for Computational Cosmology

Ogden Centre
Department of Physics 
South Road


DURHAM, DH1 3LE 
United Kingdom


e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___



[OMPI users] using the carto facility

2009-01-05 Thread Lydia Heck


I was advised for a benchmark to use the OPAL carto option to
assign specific cores to a job. I searched the web for an example
but have only found one set of man pages, which is rather cryptic
and needs the knowledge of the programmer rather than an end user.

Has anybody out there used this option and if so would you be prepared
to share an example which could be adapted for a shared memory system
with silions of cores.

Thanks.

Lydia


--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] sed: command garbled:

2006-09-21 Thread Lydia Heck

I am trying to build openmpi-1.1.2 for Solaris x86/64 with the studio11
compilers and including the mx drivers. I have gone past some hurdles.
However when the configure script nears its end where Makefiles are prepared
I get error messages of the form:

config.status: creating ompi/mca/osc/rdma/Makefile
sed: command garbled: s,@OMPI_CXX_ABSOLUTE@,/opt/studio11/SUNWspro/bin/CC
sed: command garbled: s,@OMPI_F90_ABSOLUTE@,/opt/studio11/SUNWspro/bin/f95
sed: command garbled: s,@OMPI_CC_ABSOLUTE@,/opt/studio11/SUNWspro/bin/cc
config.status: creating ompi/mca/pml/cm/Makefile


This is with the system's sed command.

I have tried to use the
gnu sed command and get instead


sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command
sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command
config.status: creating orte/Makefile
sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command
sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command
config.status: creating orte/include/Makefile
sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command
sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command
config.status: creating orte/etc/Makefile
sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command
sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command
config.status: creating orte/tools/orted/Makefile
sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command
sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command



Is there anything I do overlook?

Lydia



Re: [OMPI users] sed: command garbled:

2006-09-21 Thread Lydia Heck


My apologies I forgot to attach the config.log file.


On Thu, 21 Sep 2006, Lydia Heck wrote:

>
> I am trying to build openmpi-1.1.2 for Solaris x86/64 with the studio11
> compilers and including the mx drivers. I have gone past some hurdles.
> However when the configure script nears its end where Makefiles are prepared
> I get error messages of the form:
>
> config.status: creating ompi/mca/osc/rdma/Makefile
> sed: command garbled: s,@OMPI_CXX_ABSOLUTE@,/opt/studio11/SUNWspro/bin/CC
> sed: command garbled: s,@OMPI_F90_ABSOLUTE@,/opt/studio11/SUNWspro/bin/f95
> sed: command garbled: s,@OMPI_CC_ABSOLUTE@,/opt/studio11/SUNWspro/bin/cc
> config.status: creating ompi/mca/pml/cm/Makefile
>
>
> This is with the system's sed command.
>
> I have tried to use the
> gnu sed command and get instead
>
>
> sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command
> sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command
> config.status: creating orte/Makefile
> sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command
> sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command
> config.status: creating orte/include/Makefile
> sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command
> sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command
> config.status: creating orte/etc/Makefile
> sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command
> sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command
> config.status: creating orte/tools/orted/Makefile
> sed: file ./confstatlmaOTV/subs-3.sed line 31: unterminated `s' command
> sed: file ./confstatlmaOTV/subs-4.sed line 4: unterminated `s' command
>
>
>
> Is there anything I do overlook?
>
> Lydia
>
>

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___

openmpi-config.log.gz
Description: config.log of attempted configuration


[OMPI users] openmpi 1.3a1r12121 ...

2006-10-17 Thread Lydia Heck

I know that with 1.3a1 I a looking at a development release.
HOwever I do need the SGE (GridEngine) support and I could not find
a download for a stable (or any other) 1.2 release.

So I downloaded 1.3a1r12121 and tried to configure it.

In my configuration I use

--with-mx=/opt/mx (where the MX software is installed); also
--with-mx-libdir=/opt/mx/lib64, because I build for 64 bit only.

Then I use the Sun Studio11 compilers and the configuration fails

with


--- MCA component btl:mx (m4 configuration macro)
checking for MCA component btl:mx compile mode... dso
checking myriexpress.h usability... no
checking myriexpress.h presence... no
checking for myriexpress.h... no


configure: error: MX support requested but not found.  Aborting


I have tried everything, entering under CFLAGS etc
-I/opt/mx/include

modifying

--with-mx=/opt/mx/include 

each the configure fails with the same error.

Yes, mx is definitely installed, and yes the path to mx is definitely
/opt/mx ...

Any ideas 

Lydia Heck


--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


Re: [OMPI users] openmpi 1.3a1r12121 ...

2006-10-18 Thread Lydia Heck

I have attached the config.log file.

Here is also the instructions which I have included in the configuration

In previous configuration attempts I had --with-mx=/opt/mx where
/opt/mx is the toplevel directory under which mx is installed.

The result of the configuration attempt was the same with the same error
messages.

#!/bin/ksh
  CC="/opt/studio11/SUNWspro/bin/cc"
  CFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include"
  LDFLAGS="-xarch=amd64a -I/opt/mx/include -L/opt/mx/lib64 \
-L/opt/SUNWsge/lib/sol-amd64 -R/opt/mx/lib64 -R/opt/SUNWsge/lib/sol-amd64"
  CXX="/opt/studio11/SUNWspro/bin/CC"
  CXXFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include"
  F77="/opt/studio11/SUNWspro/bin/f95"
  FFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include"
  FC="/opt/studio11/SUNWspro/bin/f95"
  FCFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include"

PATH=/opt/studio11/SUNWspro/bin:/opt/csw/bin:/opt/sfw/bin:/usr/sfw/bin:"$PATH":/usr/ucb
export CC CFLAGS LDFLAGS CXX CXXFLAGS F77 FFLAGS FC FCFLAGS PATH

./configure --prefix=/opt/openMPI --with-mx=/opt/mx/lib64  \
  --with-mx-libdir=/opt/mx/lib64 \
 --with-wrapper-cflags=-xarch=amd64a \
  --with-wrapper-cxxflags=-xarch=amd64a \
  --with-wrapper-fflags=-xarch=amd64a \
  --with-wrapper-fcflags=-xarch=amd64a \
  --with-wrapper-ldflags=-xarch=amd64a \
  --enable-mpirun-prefix-by-default \
  --enable-dependency-tracking \
  --enable-cxx-exceptions  \
  --enable-smp-locks  \
  --enable-mpi-threads   \
  --enable-progress-threads \
  --with-threads=solaris


On Tue, 17 Oct 2006, Lydia Heck wrote:

>
> I know that with 1.3a1 I a looking at a development release.
> HOwever I do need the SGE (GridEngine) support and I could not find
> a download for a stable (or any other) 1.2 release.
>
> So I downloaded 1.3a1r12121 and tried to configure it.
>
> In my configuration I use
>
> --with-mx=/opt/mx (where the MX software is installed); also
> --with-mx-libdir=/opt/mx/lib64, because I build for 64 bit only.
>
> Then I use the Sun Studio11 compilers and the configuration fails
>
> with
>
>
> --- MCA component btl:mx (m4 configuration macro)
> checking for MCA component btl:mx compile mode... dso
> checking myriexpress.h usability... no
> checking myriexpress.h presence... no
> checking for myriexpress.h... no
>
>
> configure: error: MX support requested but not found.  Aborting
>
>
> I have tried everything, entering under CFLAGS etc
> -I/opt/mx/include
>
> modifying
>
> --with-mx=/opt/mx/include 
>
> each the configure fails with the same error.
>
> Yes, mx is definitely installed, and yes the path to mx is definitely
> /opt/mx ...
>
> Any ideas 
>
> Lydia Heck
>
>
> --
> Dr E L  Heck
>
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
>
> DURHAM, DH1 3LE
> United Kingdom
>
> e-mail: lydia.h...@durham.ac.uk
>
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___
>

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___

openmpi-config.log.gz
Description: config.log of the configuration with mx 


[OMPI users] job fails to terminate

2006-10-18 Thread Lydia Heck

I have recently installed openmpi 1.3r1212a over tcp and gigabit
on a Solaris 10 x86/64 system.

The compilation of some test codes
monte (a monte carlo estimate of pi),
connectivity which test connectivity between processes and nodes
prime, which calculates prime numbers  (these testcode are examples
which are bundled with Sun HPC).

compile fine using the openmpi version of mpicc, mpif95 and mpic++

And sometimes the jobs work fine, but most of the time the jobs freeze
leaving zombies behind.

my run time command is

mpirun --hostfile my-hosts -mca pls_rsh_agent rsh --mca btl tcp,self -np 14 \
monte

and I get as output
oberon(209) > mpirun --hostfile my-hosts -mca pls_rsh_agent rsh --mca btl
tcp,self -np 14 monte
Monte-Carlo estimate of pi by   14 processes is 3.141503.

with the cursor hanging.

The process table shows

oberon# ps -eaf | grep dph0elh
 dph0elh  9583  7445   7 17:45:01 pts/26  9:22 mpirun --hostfile my-hosts
-mca pls_rsh_agent rsh --mca btl tcp,self -np 14 mon
 dph0elh  9595  9588   0- ?   0:02 
 dph0elh  9588 1   7 17:45:01 ??  9:03 orted --bootproxy 1 --name
0.0.1 --num_procs 5 --vpid_start 0 --nodename oberon
 dph0elh  7445  6924   0 17:01:38 pts/26  0:00 -tcsh
root  9656  4151   0 18:01:31 pts/36  0:00 grep dph0elh
 dph0elh  9593  9588   0- ?   0:02 


one of the nodes offers 8 cpus the other nodes in the hostfile offer 2.
There are a total of 14 cpus available. and as you can see from the command line
I use --mca btl tcp,self

There are no other interconnects.

I could not find any entry in the FAQs, except for the advice on using
--mca btl tcp,self.




--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


Re: [OMPI users] job fails to terminate

2006-10-20 Thread Lydia Heck

In answer to Ralph's request and question.

Indeed the version number was incorrect it should have been

openmpi-1.3a1r12121

my configure command is

#!/bin/ksh
  CC="/opt/studio11/SUNWspro/bin/cc"
  CFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include"
  LDFLAGS="-xarch=amd64a -I/opt/mx/include -L/opt/SUNWsge/lib/sol-amd64
-R/opt/mx/lib64 -R/opt/SUNWsge/lib/sol-amd64"
  CXX="/opt/studio11/SUNWspro/bin/CC"
  CXXFLAGS="-xarch=amd64a -I/opt/SUNWsge/include"
  F77="/opt/studio11/SUNWspro/bin/f95"
  FFLAGS="-xarch=amd64a -I/opt/SUNWsge/include"
  FC="/opt/studio11/SUNWspro/bin/f95"
  FCFLAGS="-xarch=amd64a -I/opt/SUNWsge/include"

PATH=/opt/studio11/SUNWspro/bin:/opt/csw/bin:/opt/sfw/bin:/usr/sfw/bin:"$PATH":/usr/ucb
export CC CFLAGS LDFLAGS CXX CXXFLAGS F77 FFLAGS FC FCFLAGS PATH

./configure --prefix=/opt/openMPI-GB \
 --with-wrapper-cflags=-xarch=amd64a \
  --with-wrapper-cxxflags=-xarch=amd64a \
  --with-wrapper-fflags=-xarch=amd64a \
  --with-wrapper-fcflags=-xarch=amd64a \
  --with-wrapper-ldflags=-xarch=amd64a \
  --enable-mpirun-prefix-by-default \
  --enable-dependency-tracking \
  --enable-cxx-exceptions  \
  --enable-smp-locks  \
  --enable-mpi-threads   \
  --enable-progress-threads \
  --with-threads=solaris


Lydia

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


Re: [OMPI users] users Digest, Vol 411, Issue 2

2006-10-20 Thread Lydia Heck

Hi Ralph,

which of the thread options should I remove:

> >   --enable-mpi-threads   \
> >   --enable-progress-threads \
> >   --with-threads=solaris

all of them?

Lydia

>
> --
>
> Message: 1
> Date: Fri, 20 Oct 2006 06:30:36 -0600
> From: Ralph H Castain 
> Subject: Re: [OMPI users] job fails to terminate
> To: "Open MPI Users " 
> Message-ID: 
> Content-Type: text/plain; charset="US-ASCII"
>
> Hi Lydia
>
> Thanks - that does help!
>
> Could you try this without threads? We have tried to make the system work
> with threads, but our testing has been limited. First thing I would try is
> to make sure that we aren't hitting a thread-lock.
>
> Thanks
> Ralph
>
>
>
> On 10/20/06 2:11 AM, "Lydia Heck"  wrote:
>
> >
> > In answer to Ralph's request and question.
> >
> > Indeed the version number was incorrect it should have been
> >
> > openmpi-1.3a1r12121
> >
> > my configure command is
> >
> > #!/bin/ksh
> >   CC="/opt/studio11/SUNWspro/bin/cc"
> >   CFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include"
> >   LDFLAGS="-xarch=amd64a -I/opt/mx/include -L/opt/SUNWsge/lib/sol-amd64
> > -R/opt/mx/lib64 -R/opt/SUNWsge/lib/sol-amd64"
> >   CXX="/opt/studio11/SUNWspro/bin/CC"
> >   CXXFLAGS="-xarch=amd64a -I/opt/SUNWsge/include"
> >   F77="/opt/studio11/SUNWspro/bin/f95"
> >   FFLAGS="-xarch=amd64a -I/opt/SUNWsge/include"
> >   FC="/opt/studio11/SUNWspro/bin/f95"
> >   FCFLAGS="-xarch=amd64a -I/opt/SUNWsge/include"
> >
> > PATH=/opt/studio11/SUNWspro/bin:/opt/csw/bin:/opt/sfw/bin:/usr/sfw/bin:"$PATH"
> > :/usr/ucb
> > export CC CFLAGS LDFLAGS CXX CXXFLAGS F77 FFLAGS FC FCFLAGS PATH
> >
> > ./configure --prefix=/opt/openMPI-GB \
> >  --with-wrapper-cflags=-xarch=amd64a \
> >   --with-wrapper-cxxflags=-xarch=amd64a \
> >   --with-wrapper-fflags=-xarch=amd64a \
> >   --with-wrapper-fcflags=-xarch=amd64a \
> >   --with-wrapper-ldflags=-xarch=amd64a \
> >   --enable-mpirun-prefix-by-default \
> >   --enable-dependency-tracking \
> >   --enable-cxx-exceptions  \
> >   --enable-smp-locks  \
> >   --enable-mpi-threads   \
> >   --enable-progress-threads \
> >   --with-threads=solaris
> >
> >
> > Lydia
> >
> > --
> > Dr E L  Heck
> >
> > University of Durham
> > Institute for Computational Cosmology
> > Ogden Centre
> > Department of Physics
> > South Road
> >
> > DURHAM, DH1 3LE
> > United Kingdom
> >
> > e-mail: lydia.h...@durham.ac.uk
> >
> > Tel.: + 44 191 - 334 3628
> > Fax.: + 44 191 - 334 3645
> > ___
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> --
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 411, Issue 2
> *
>

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] btl mx : file not found

2006-11-18 Thread Lydia Heck


I have myricom mx installed and configured and its communications work (using
mx commands such as mx_info to check)
Then I configured openmpi-1.3a1r12408 with mx and the configuration
did give no errors. The built of the openmpi was without problems and it
installed properly. I can build and link a program - and ldd shows the
openmpi libraries linked accordingly.

In order to run applications I set the LD_LIBRARY_PATH and the PATH correctly
but the command.


ompi_info | grep mx
[m2001:12844] mca: base: component_find: unable to open mtl mx: file not found
(ignored)
[m2001:12844] mca: base: component_find: unable to open btl mx: file not found
(ignored)
[m2001:12844] mca: base: component_find: unable to open mtl mx: file not found
(ignored)

and indeed the job does not run, if I give the instruction

-mca btl mx

Any idea why this should happen?

Lydia



--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___

openmpi-1.3a1r12408-config.log.gz
Description: config.log of openmpi-1.3a1r1240 with myrinet mx


Re: [OMPI users] btl mx : file not found

2006-11-20 Thread Lydia Heck

I have solved this problem myself.

The mx drivers are built using the gcc compilers both in 64 and 32 bit.

I was trying to build 64-bit openmpi on the sun and I am afraid I overlooked
that I had to give the path to the 64-bit gcc libs EXPLICITLY in the build
of the openmpi. These libraries were required for the mx libraries to be linked
in correctly  otherwise the environment would only look for the 32-bit which
led to the mx environment not being fully configured.

Lydia


--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] myrinet mx and openmpi using solaris, sun compilers

2006-11-20 Thread Lydia Heck

I have built the myrinet drivers with gcc or the studio 11 compilers from sun.
The following problem appears for both installations.

I have tested the myrinet installations using myricoms own test programs.

Then I build open-mpi using the studio11 compilers enabling myrinet.

All the library paths are correctly set and I can run my test program
which is written in C, successfully, if I choose the number of CPUs to be equal
the number of nodes, which means one instance of process per node!

Each node has 4 CPUs.

If I now request the number of CPUs for the run to be larger than the
number of nodes I get an error message, which clearly indicates
that the openmpi cannot communicate over more than one channel
on the myrinet card. However I should be able to communicate over
4 channels at least - colleagues of mine are doing that using
mpich and the same type of myrinet card.

Any idead why this should happen?

the hostfile looks like:

m2009 slots=4
m2010 slots=4


but it will provide the same error if the hosts file is

m2009
m2010

ompi_info | grep mx
2001(128) > ompi_info | grep mx
 MCA btl: mx (MCA v1.0, API v1.0.1, Component v1.2)
 MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2)
m2009(160) > /opt/mx/bin/mx_endpoint_info
1 Myrinet board installed.
The MX driver is configured to support up to 4 endpoints on 4 boards.
===
Board #0:
Endpoint PID Command Info
   15039
0   15544
There are currently 1 regular endpoint open




m2001(120) > mpirun -np 6 -hostfile hostsfile -mca btl mx,self  b_eff
--
Process 0.1.0 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
Process 0.1.2 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
Process 0.1.4 is unable to reach 0.1.4 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
Process 0.1.5 is unable to reach 0.1.4 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
Process 0.1.3 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  --
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
--

Re: [OMPI users] myrinet mx and openmpi using solaris, sun compilers

2006-11-21 Thread Lydia Heck

Thank you very much.

I tried

mpirun -np 6 -machinefile ./myh -mca pml cm ./b_eff

and to amuse you

 mpirun -np 6 -machinefile ./myh -mca btl mx,sm,self ./b_eff

with myh containing two host names

and both commands went swimmingly.

To make absolutely sure, I checked the usage of the myrinet ports
and on each system 3 myrinet ports were open.

Lydia

On Mon, 20 Nov 2006 users-requ...@open-mpi.org wrote:
>
> --
>
> Message: 2
> Date: Mon, 20 Nov 2006 20:05:22 + (GMT)
> From: Lydia Heck 
> Subject: [OMPI users] myrinet mx and openmpi using solaris, sun
>   compilers
> To: us...@open-mpi.org
> Message-ID:
>   
> Content-Type: TEXT/PLAIN; charset=US-ASCII
>
>
> I have built the myrinet drivers with gcc or the studio 11 compilers from sun.
> The following problem appears for both installations.
>
> I have tested the myrinet installations using myricoms own test programs.
>
> Then I build open-mpi using the studio11 compilers enabling myrinet.
>
> All the library paths are correctly set and I can run my test program
> which is written in C, successfully, if I choose the number of CPUs to be 
> equal
> the number of nodes, which means one instance of process per node!
>
> Each node has 4 CPUs.
>
> If I now request the number of CPUs for the run to be larger than the
> number of nodes I get an error message, which clearly indicates
> that the openmpi cannot communicate over more than one channel
> on the myrinet card. However I should be able to communicate over
> 4 channels at least - colleagues of mine are doing that using
> mpich and the same type of myrinet card.
>
> Any idead why this should happen?
>
> the hostfile looks like:
>
> m2009 slots=4
> m2010 slots=4
>
>
> but it will provide the same error if the hosts file is
>
> m2009
> m2010
>
> ompi_info | grep mx
> 2001(128) > ompi_info | grep mx
>  MCA btl: mx (MCA v1.0, API v1.0.1, Component v1.2)
>  MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2)
> m2009(160) > /opt/mx/bin/mx_endpoint_info
> 1 Myrinet board installed.
> The MX driver is configured to support up to 4 endpoints on 4 boards.
> ===
> Board #0:
> Endpoint PID Command Info
>15039
> 0   15544
> There are currently 1 regular endpoint open
>
>
>
>
> m2001(120) > mpirun -np 6 -hostfile hostsfile -mca btl mx,self  b_eff
> --
> Process 0.1.0 is unable to reach 0.1.0 for MPI communication.
> If you specified the use of a BTL component, you may have
> forgotten a component (such as "self") in the list of
> usable components.
> --
> --
> Process 0.1.2 is unable to reach 0.1.0 for MPI communication.
> If you specified the use of a BTL component, you may have
> forgotten a component (such as "self") in the list of
> usable components.
> --
> --
> Process 0.1.4 is unable to reach 0.1.4 for MPI communication.
> If you specified the use of a BTL component, you may have
> forgotten a component (such as "self") in the list of
> usable components.
> --
> --
> Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
> If you specified the use of a BTL component, you may have
> forgotten a component (such as "self") in the list of
> usable components.
> --
> --
> Process 0.1.5 is unable to reach 0.1.4 for MPI communication.
> If you specified the use of a BTL component, you may have
> forgotten a component (such as "self") in the list of
> usable components.
> --
> --
> Process 0.1.3 is unable to reach 0.1.0 for MPI communication.
> If you specified the use of a BTL component, you may have
> forgotten a component (such as "self") in the list of
> usable components.
> --
> -

[OMPI users] openmpi, mx

2006-11-22 Thread Lydia Heck

I have - again - successfully built and installed
mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
version of openmpi is 1.2b1

compiler used: studio11

The code is a benchmark b_eff which runs usually fine - I have used extensively
it for benchmarking

When I try 192 CPUs I get
m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
 ...
..
..

The myrinet ports have been opened and the job is running
as one of the nodes shows 

 ps -eaf | grep dph0elh
 dph0elh  1068 1   0 20:40:00 ??  0:00 /opt/ompi/bin/orted
--bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
root  1110  1106   0 20:43:46 pts/4   0:00 grep dph0elh
 dph0elh  1070  1068   0 20:40:02 ??  0:00 ../b_eff
 dph0elh  1074  1068   0 20:40:02 ??  0:00 ../b_eff
 dph0elh  1072  1068   0 20:40:02 ??  0:00 ../b_eff

any idea ?

Lydia


--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] openmpi - mx - solaris and Gadget2

2006-11-23 Thread Lydia Heck

Gadget2 - I cannot attach it because it is not publicly available,
runs perfectly fine on any number of processes on systems such
as Solaris 10 - Sun CT6 gigabit, SUN CT5 and myrinet gm, IBM regatta ..

Sorry to be so expansive ...

When I run the code on 32 CPUs on openmpi, mx using the studio11 compilers
on a solaris x64 system the code works fine, until about the end, when
it fails to write all the restart files.

When I run the code on 64 CPUs it fails with an error message which is

Topnodes=218193 costlimit=0.0890015 countlimit=428.229
Before=44417
After=46281
NTopleaves= 40496  NTopnodes=46281 (space for 347252)
desired memory imbalance=2.83425  (limit=100719, needed=114185)
Note: the domain decomposition is suboptimum because the ceiling for
memory-imbalance is reached
work-load balance=1.28529   memory-balance=1.01948
exchange of 0002589387 particles
Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR)
Failing at addr:5192cbd0
/opt/ompi/lib/libopal.so.0.0.0:opal_backtrace_print+0x10
/opt/ompi/lib/libopal.so.0.0.0:0x99df5
/lib/amd64/libc.so.1:0xcb276
/lib/amd64/libc.so.1:0xc0642
/opt/mx/lib/amd64/libmyriexpress.so:mx__luigi+0xd5 [ Signal 11 (SEGV)]
/opt/mx/lib/amd64/libmyriexpress.so:mx_irecv+0x174
/opt/ompi/lib/openmpi/mca_mtl_mx.so:ompi_mtl_mx_irecv+0x116
/opt/ompi/lib/openmpi/mca_pml_cm.so:mca_pml_cm_irecv+0x27b
/opt/ompi/lib/libmpi.so.0.0.0:PMPI_Irecv+0x1ae
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_exchange+0x11b7
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_decompose+0x4da
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_Decomposition+0x467
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:run+0x9f
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:main+0x191
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc
*** End of error message ***
63 additional processes aborted (not shown)
m2001(26) > /opt/ompi/bin/mpirun -np 32 -machinefile ./myh-all -mca pml cm
./Gadget2 param.txt

As this is one of our predominant production codes, I need to make sure
that it is running on any system which I install. Any idea would be welcome.

Lydia



--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


Re: [OMPI users] openmpi - mx - solaris and Gadget2

2006-11-23 Thread Lydia Heck

The same run on 32 CPUs almost completes, starting to write 32 re-start
files and fails with the same problem:

Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR)
Failing at addr:33
/opt/ompi/lib/libopal.so.0.0.0:opal_backtrace_print+0x10
/opt/ompi/lib/libopal.so.0.0.0:0x99df5
/lib/amd64/libc.so.1:0xcb276
/lib/amd64/libc.so.1:0xc0642
/opt/mx/lib/amd64/libmyriexpress.so:0x102c7 [ Signal 11 (SEGV)]
/opt/mx/lib/amd64/libmyriexpress.so:mx__luigi+0x3d
/opt/mx/lib/amd64/libmyriexpress.so:mx__test_common+0x22
/opt/mx/lib/amd64/libmyriexpress.so:mx_test+0x37
/opt/ompi/lib/openmpi/mca_mtl_mx.so:ompi_mtl_mx_send+0x288
/opt/ompi/lib/openmpi/mca_pml_cm.so:mca_pml_cm_send+0x3fc
/opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_sendrecv_actual_localcompleted+0x85
/opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_barrier_intra_recursivedoubling+0x1a3
/opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_barrier_intra_dec_fixed+0x44
/opt/ompi/lib/libmpi.so.0.0.0:MPI_Barrier+0x9d
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:restart+0x9a0
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:run+0x219
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:main+0x191
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc
*** End of error message ***
mv: cannot access ./restart.20
31 additional processes aborted (not shown)
m2001(27) >



On Thu, 23 Nov 2006, Lydia Heck wrote:

>
> Gadget2 - I cannot attach it because it is not publicly available,
> runs perfectly fine on any number of processes on systems such
> as Solaris 10 - Sun CT6 gigabit, SUN CT5 and myrinet gm, IBM regatta ..
>
> Sorry to be so expansive ...
>
> When I run the code on 32 CPUs on openmpi, mx using the studio11 compilers
> on a solaris x64 system the code works fine, until about the end, when
> it fails to write all the restart files.
>
> When I run the code on 64 CPUs it fails with an error message which is
>
> Topnodes=218193 costlimit=0.0890015 countlimit=428.229
> Before=44417
> After=46281
> NTopleaves= 40496  NTopnodes=46281 (space for 347252)
> desired memory imbalance=2.83425  (limit=100719, needed=114185)
> Note: the domain decomposition is suboptimum because the ceiling for
> memory-imbalance is reached
> work-load balance=1.28529   memory-balance=1.01948
> exchange of 0002589387 particles
> Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR)
> Failing at addr:5192cbd0
> /opt/ompi/lib/libopal.so.0.0.0:opal_backtrace_print+0x10
> /opt/ompi/lib/libopal.so.0.0.0:0x99df5
> /lib/amd64/libc.so.1:0xcb276
> /lib/amd64/libc.so.1:0xc0642
> /opt/mx/lib/amd64/libmyriexpress.so:mx__luigi+0xd5 [ Signal 11 (SEGV)]
> /opt/mx/lib/amd64/libmyriexpress.so:mx_irecv+0x174
> /opt/ompi/lib/openmpi/mca_mtl_mx.so:ompi_mtl_mx_irecv+0x116
> /opt/ompi/lib/openmpi/mca_pml_cm.so:mca_pml_cm_irecv+0x27b
> /opt/ompi/lib/libmpi.so.0.0.0:PMPI_Irecv+0x1ae
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_exchange+0x11b7
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_decompose+0x4da
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_Decomposition+0x467
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:run+0x9f
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:main+0x191
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc
> *** End of error message ***
> 63 additional processes aborted (not shown)
> m2001(26) > /opt/ompi/bin/mpirun -np 32 -machinefile ./myh-all -mca pml cm
> ./Gadget2 param.txt
>
> As this is one of our predominant production codes, I need to make sure
> that it is running on any system which I install. Any idea would be welcome.
>
> Lydia
>
>
>
> --
> Dr E L  Heck
>
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
>
> DURHAM, DH1 3LE
> United Kingdom
>
> e-mail: lydia.h...@durham.ac.uk
>
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___
>

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


Re: [OMPI users] openmpi - mx - solaris and Gadget2 - add on

2006-11-24 Thread Lydia Heck

I saved two cores, which might be of interest. However they
are so large, that I cannot attach them to any email. But
I am very willing to submit them, if requested.

Lydia

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] problem building openmpi-1.2b1r12657

2006-11-25 Thread Lydia Heck

The configuration of openmpi-1.2b1r12657 goes fine.
When I try to build I get  somewhere are into the buid the following
error message.


DEPDIR=.deps depmode=none /bin/bash ../../../../config/depcomp \
/bin/bash ../../../../libtool --tag=CC --mode=compile
/opt/studio11/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I.
./../../../opal/include -I../../../../orte/include -I../../../../ompi/include
-I../../../../ompi/include   -I..
/../../..-DNDEBUG -g -O -xtarget=opteron -xarch=amd64 -mt -c -o
common_sm_mmap.lo common_sm_mmap.c
libtool: compile:  /opt/studio11/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I.
-I../../../../opal/include -I../../../
../orte/include -I../../../../ompi/include -I../../../../ompi/include
-I../../../.. -DNDEBUG -g -O -xtarget=opt
eron -xarch=amd64 -mt -c common_sm_mmap.c  -KPIC -DPIC -o .libs/common_sm_mmap.o
Assembler: common_sm_mmap.c
"/tmp/IAAztaqgp", line 11799 : Trouble closing elf file
cc: ube failed for common_sm_mmap.c
gmake[2]: *** [common_sm_mmap.lo] Error 1
gmake[2]: Leaving directory
`/hpcconsole-1/SOFTWARE/openmpi-1.2b1r12657/ompi/mca/common/sm'
gmake[1]: *** [all-recursive] Error 1
gmake[1]: Leaving directory `/hpcconsole-1/SOFTWARE/openmpi-1.2b1r12657/ompi'
gmake: *** [all-recursive] Error 1


I know that this is in development, but the openmpi-1.2b1
fails to run one our major codes. So I hoped that with the more
recent version I would be more successful

Lydia


--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


Re: [OMPI users] problem building openmpi-1.2b1r12657

2006-11-25 Thread Lydia Heck

My apologies 
This was a red herring. It turned out that I had filled the disk.
It so happened that the same error was repeated several time, even after
reconfiguring.

Lydia

On Sat, 25 Nov 2006, Lydia Heck wrote:

>
> The configuration of openmpi-1.2b1r12657 goes fine.
> When I try to build I get  somewhere are into the buid the following
> error message.
>
>
> DEPDIR=.deps depmode=none /bin/bash ../../../../config/depcomp \
> /bin/bash ../../../../libtool --tag=CC --mode=compile
> /opt/studio11/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I.
> ./../../../opal/include -I../../../../orte/include -I../../../../ompi/include
> -I../../../../ompi/include   -I..
> /../../..-DNDEBUG -g -O -xtarget=opteron -xarch=amd64 -mt -c -o
> common_sm_mmap.lo common_sm_mmap.c
> libtool: compile:  /opt/studio11/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I.
> -I../../../../opal/include -I../../../
> ../orte/include -I../../../../ompi/include -I../../../../ompi/include
> -I../../../.. -DNDEBUG -g -O -xtarget=opt
> eron -xarch=amd64 -mt -c common_sm_mmap.c  -KPIC -DPIC -o 
> .libs/common_sm_mmap.o
> Assembler: common_sm_mmap.c
> "/tmp/IAAztaqgp", line 11799 : Trouble closing elf file
> cc: ube failed for common_sm_mmap.c
> gmake[2]: *** [common_sm_mmap.lo] Error 1
> gmake[2]: Leaving directory
> `/hpcconsole-1/SOFTWARE/openmpi-1.2b1r12657/ompi/mca/common/sm'
> gmake[1]: *** [all-recursive] Error 1
> gmake[1]: Leaving directory `/hpcconsole-1/SOFTWARE/openmpi-1.2b1r12657/ompi'
> gmake: *** [all-recursive] Error 1
>
>
> I know that this is in development, but the openmpi-1.2b1
> fails to run one our major codes. So I hoped that with the more
> recent version I would be more successful
>
> Lydia
>
>
> --
> Dr E L  Heck
>
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
>
> DURHAM, DH1 3LE
> United Kingdom
>
> e-mail: lydia.h...@durham.ac.uk
>
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___
>

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


Re: [OMPI users] users Digest, Vol 443, Issue 1

2006-11-26 Thread Lydia Heck

You have to make sure that the path to the gm libraries is fully
set at runtime of your code:

LD_LIBRARY_PATH="$PATH":/xx/gm/lib

and of course xx stands for the location of your path to the where the gm
directory is located.

Also for better performance you might want to use the sun compilers for
f77 as well.


export F77=/opt/SUNWspro/bin/f95
export FC=/opt/SUNWspro/bin/f95

Lydia

>
> Message: 3
> Date: Sat, 25 Nov 2006 22:15:07 -0400
> From: brem...@unb.ca
> Subject: [OMPI users] Myrinet/GM can't find any NICs
> To: us...@open-mpi.org
> Message-ID: <0tk61juf6c.wl%brem...@pivot.cs.unb.ca>
> Content-Type: text/plain; charset=US-ASCII
>
>
> Dear experts;
>
> I built openmpi-1.2b1 on solaris x86, enabling GM. Test jobs seem to
> run OK, but I assume it is falling back on TCP over ethernet.
> On of the following messages for each node.
> (The output from ompi_info follows; config.log and the full output can
> be found at http://www.cs.unb.ca/~bremner/openmpi)
>
> [cl023:14729] [0,1,1] gm_port 0828CBA8, board 0, global 3712481415 node 1 
> port 4
> [cl023:14729] [mpool_gm_module.c:100] error(32) registering gm memory
> [cl023:14729] [mpool_gm_module.c:100] error(32) registering gm memory
> [cl023:14729] [mpool_gm_module.c:100] error(32) registering gm memory
> [cl023:14729] [btl_gm_component.c:409] unable to initialze gm port
> [cl023:14727] [0,1,0] gm_port 0828CBA8, board 0, global 3712481415 node 1 
> port 5
> [cl023:14727] [mpool_gm_module.c:100] error(32) registering gm memory
> [cl023:14727] [mpool_gm_module.c:100] error(32) registering gm memory
> [cl023:14727] [mpool_gm_module.c:100] error(32) registering gm memory
> [cl023:14727] [btl_gm_component.c:409] unable to initialze gm port
> --
> [0,1,0]: Myrinet/GM on host cl023 was unable to find any NICs.
> Another transport will be used instead, although this may result in
> lower performance.
> --
>
>
>
> Open MPI: 1.2b1
>Open MPI SVN revision: r12562
> Open RTE: 1.2b1
>Open RTE SVN revision: r12562
> OPAL: 1.2b1
>OPAL SVN revision: r12562
>   Prefix: /home/dbremner/pkg/openmpi-1.2b1-gm
>  Configured architecture: i386-pc-solaris2.10
>Configured by:
>Configured on: Sat Nov 25 16:56:01 AST 2006
>   Configure host: clhead
> Built by: dbremner
> Built on: Saturday November 25 17:16:33 AST 2006
>   Built host: clhead
>   C bindings: yes
> C++ bindings: yes
>   Fortran77 bindings: yes (all)
>   Fortran90 bindings: no
>  Fortran90 bindings size: na
>   C compiler: gcc
>  C compiler absolute: /home/dbremner/bin/gcc
> C++ compiler: g++
>C++ compiler absolute: /home/dbremner/bin/g++
>   Fortran77 compiler: g77
>   Fortran77 compiler abs: /opt/sfw/gcc-2/bin/g77
>   Fortran90 compiler: f95
>   Fortran90 compiler abs: /opt/SUNWspro/bin/f95
>  C profiling: yes
>C++ profiling: yes
>  Fortran77 profiling: yes
>  Fortran90 profiling: no
>   C++ exceptions: no
>   Thread support: solaris (mpi: no, progress: no)
>   Internal debug support: no
>  MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
>  libltdl support: yes
>  mpirun default --prefix: no
>MCA backtrace: printstack (MCA v1.0, API v1.0, Component v1.2)
>MCA paffinity: solaris (MCA v1.0, API v1.0, Component v1.2)
>MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2)
>MCA timer: solaris (MCA v1.0, API v1.0, Component v1.2)
>MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: basic (MCA v1.0, API v1.0, Component v1.2)
> MCA coll: self (MCA v1.0, API v1.0, Component v1.2)
> MCA coll: sm (MCA v1.0, API v1.0, Component v1.2)
> MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2)
>   MCA io: romio (MCA v1.0, API v1.0, Component v1.2)
>MCA mpool: gm (MCA v1.0, API v1.0, Component v1.2)
>MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2)
>MCA mpool: udapl (MCA v1.0, API v1.0, Component v1.2)
>  MCA pml: cm (MCA v1.0, API v1.0, Component v1.2)
>  MCA pml: dr (MCA v1.0, API v1.0, Component v1.2)
>  MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2)
>  MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2)
>   MCA rcache: rb (MCA v1.0, API v1.0, Component v1.2)
>   MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2)
>  MCA btl: gm (MCA v1.0, API v1.0.1, Component v1.2)
>  

[OMPI users] openmpi 1.2b1(r12657)

2006-12-10 Thread Lydia Heck

I am running the benchmark b_eff on a mulitprocessor opteron based system.
The benchmark measures throughput. And the benchmark runs fine over
tcp/ip and myrinet on cluster of 2 a 4 cores. When I run the
application on an 8core system over 2 cpus the run is fine. When I run it
over say 4 or more I get the error:


/opt/ompi/bin/mpirun -np 4 -machinefile myh -mca btl tcp,self b_eff

I get sometimes an error such as

ERROR - invalid message content after MPI_Alltoallv - myrank=1 i_rep=0 i_msg=15
i_pat=14 i_sr=1 i_loop=5 Msglng=71468 buf=(16 0)!=(16 31)

But not always. I searched the FAQs, could not find an entry with a similar
error.

Any idea?

Lydia

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] crashed openmpi job fails to clean up ....

2006-12-19 Thread Lydia Heck

A job which crashes with an floating point underflow (or any IEEE floating point
exception) fails to clean up after itself using

openmpi-1.3a1r12695 ..

Nodes with copies of slaves are sitting there ...

I also noticed that orted are left behind on other crashed jobs ..

Should I have to expect this?

Lydia

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] SEGV in ompi_coll_tuned_reduce_generic (1.2b4r13488)

2007-02-14 Thread Lydia Heck

When running either over myrinet or over gigabit one of our codes (Gagdet2)
it fails predictably with the following error message.
>From the back trace it looks as if the SEGV is in
ompi_coll_tuned_reduce_generic.

Have there been similar reportings and/or is there a fix for this?

Lydia Heck


[m2042:08002] *** Process received signal ***
[m2042:08002] Signal: Segmentation Fault (11)
[m2042:08002] Signal code: Address not mapped (1)
[m2042:08002] Failing at address: 92
/opt/OMPI/ompi-1.2b4r13488/lib/libopen-pal.so.0.0.0:opal_backtrace_print+0x26
/opt/OMPI/ompi-1.2b4r13488/lib/libopen-pal.so.0.0.0:0xc3874
/lib/amd64/libc.so.1:0xcb686
/lib/amd64/libc.so.1:0xc0a52
/opt/OMPI/ompi-1.2b4r13488/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_reduce_generic+0x11b
[ Signal 11 (SEGV)]
/opt/OMPI/ompi-1.2b4r13488/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_reduce_intra_binary+0x162
/opt/OMPI/ompi-1.2b4r13488/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_reduce_intra_dec_fixed+0x28d
/opt/OMPI/ompi-1.2b4r13488/lib/libmpi.so.0.0.0:PMPI_Reduce+0x3f6
/data/4/nil/tak_gadget/gadget2/P-Gadget2:gravity_tree+0x146c
/data/4/nil/tak_gadget/gadget2/P-Gadget2:compute_accelerations+0x7e
/data/4/nil/tak_gadget/gadget2/P-Gadget2:run+0xa5
/data/4/nil/tak_gadget/gadget2/P-Gadget2:main+0x22f
/data/4/nil/tak_gadget/gadget2/P-Gadget2:0x7c3c
[m2042:08002] *** End of error message ***
[m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c
at line 275
[m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_gridengine_module.c at
line 793
[m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
mpirun noticed that job rank 2 with PID 0 on node m2043 exited on signal 11
(Segmentation Fault).
[m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c
at line 188
[m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_gridengine_module.c at
line 828
--
mpirun was unable to cleanly terminate the daemons for this job. Returned value
Timeout instead of ORTE_SUCCESS.




--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


[OMPI users] MPI reduce ...

2007-02-23 Thread Lydia Heck


I was asked by a user if the MPI allreduce recognizes when
process ids are situated on the same node so that the communication
can then proceed over shared memory rather over the slower networking
communication channels.

Would anyone of the openmpi developers be able to comment on
that question and situation on that in openmpi?

Lydia

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___