I have experimented a bit more and found that if I set
OMPI_MCA_plm_rsh_num_concurrent=1024
a job with more than 2,500 processes will start and run.
However when I searched the open-mpi web site for the the variable I could not
find any indication.
Best wishes,
Lydia Heck
15. jobs
count to more than 2700 cores and a job with
2,500 jobs does not start.
Is there any advice?
Best wishes,
Lydia Heck
--
Dr E L Heck
Senior Computer Manager
University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
o the more recent versions.
If the developers are interested, I could ask the user to prepare the code for
you to have a look at the problem which looks like to be in MPI_Alloc_mem.
Best wishes,
Lydia Heck
--
Dr E L Heck
University of Durham
Institut
One of the big cosmology codes is Gadget-3 (Springel et al).
The code uses MPI for interprocess communications. At the ICC in Durham we use
OpenMPI and have been using it for ~3 years.
At the ICC Gadget-3 is one of the major research codes and we have been running
it since it was written and
I was advised for a benchmark to use the OPAL carto option to
assign specific cores to a job. I searched the web for an example
but have only found one set of man pages, which is rather cryptic
and needs the knowledge of the programmer rather than an end user.
Has anybody out there used this opt
In one of our big runs (512 cpus) the code fails and produces on a list
of nodes the following type of error:
I have searched the FAQs but could not find an answer there.
There are difficulties getting the code to run because of its shear size
but there is no other indication of the problem.
Doe
I users] how to select a specific network
> To: Open MPI Users
> Message-ID: <2008023416.gq11...@ltw.loris.tv>
> Content-Type: text/plain; charset=iso-8859-1
>
> On Fri, Jan 11, 2008 at 11:36:23AM +, Lydia Heck wrote:
>
> > I have a setup which contains one set
I should have added that the two networks are not routable,
and that they are private class B.
On Fri, 11 Jan 2008, Lydia Heck wrote:
>
> I have a setup which contains one set of machines
> with one nge and one e1000g network and of machines
> with two e1000g networks configured. I
I have a setup which contains one set of machines
with one nge and one e1000g network and of machines
with two e1000g networks configured. I am planning a
large run where all these computers will be occupied
with one job and the mpi communication should only go
over one specific network which is c
One of our programs has got stuck - it has not terminated -
with the error messages:
mca_btl_tcp_frag_send: writev failed with errno=131.
Searching the openmpi web site did not result in a positive hit.
What does it mean?
I am running 1.2.1r14096
Lydia
I was asked by a user if the MPI allreduce recognizes when
process ids are situated on the same node so that the communication
can then proceed over shared memory rather over the slower networking
communication channels.
Would anyone of the openmpi developers be able to comment on
that question
When running either over myrinet or over gigabit one of our codes (Gagdet2)
it fails predictably with the following error message.
>From the back trace it looks as if the SEGV is in
ompi_coll_tuned_reduce_generic.
Have there been similar reportings and/or is there a fix for this?
Lydia H
A job which crashes with an floating point underflow (or any IEEE floating point
exception) fails to clean up after itself using
openmpi-1.3a1r12695 ..
Nodes with copies of slaves are sitting there ...
I also noticed that orted are left behind on other crashed jobs ..
Should I have to expect t
I am running the benchmark b_eff on a mulitprocessor opteron based system.
The benchmark measures throughput. And the benchmark runs fine over
tcp/ip and myrinet on cluster of 2 a 4 cores. When I run the
application on an 8core system over 2 cpus the run is fine. When I run it
over say 4 or more I
You have to make sure that the path to the gm libraries is fully
set at runtime of your code:
LD_LIBRARY_PATH="$PATH":/xx/gm/lib
and of course xx stands for the location of your path to the where the gm
directory is located.
Also for better performance you might want to use the sun compilers fo
My apologies
This was a red herring. It turned out that I had filled the disk.
It so happened that the same error was repeated several time, even after
reconfiguring.
Lydia
On Sat, 25 Nov 2006, Lydia Heck wrote:
>
> The configuration of openmpi-1.2b1r12657 goes fine.
> When I try
The configuration of openmpi-1.2b1r12657 goes fine.
When I try to build I get somewhere are into the buid the following
error message.
DEPDIR=.deps depmode=none /bin/bash ../../../../config/depcomp \
/bin/bash ../../../../libtool --tag=CC --mode=compile
/opt/studio11/SUNWspro/bin/cc -DHAVE_CONF
I saved two cores, which might be of interest. However they
are so large, that I cannot attach them to any email. But
I am very willing to submit them, if requested.
Lydia
--
Dr E L Heck
University of Durham
Institute for Computational Cosmology
Ogden Ce
/Gadget2-multidomain/Gadget2:main+0x191
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc
*** End of error message ***
mv: cannot access ./restart.20
31 additional processes aborted (not shown)
m2001(27) >
On Thu, 23 Nov 2006, Lydia Heck wrote:
>
> Gadget2 - I cannot
Gadget2 - I cannot attach it because it is not publicly available,
runs perfectly fine on any number of processes on systems such
as Solaris 10 - Sun CT6 gigabit, SUN CT5 and myrinet gm, IBM regatta ..
Sorry to be so expansive ...
When I run the code on 32 CPUs on openmpi, mx using the studio11
I have - again - successfully built and installed
mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
version of openmpi is 1.2b1
compiler used: studio11
The code is a benchmark b_eff which runs usually fine - I have used extensively
it for benchmarking
When I try 192 CPUs I
ports
and on each system 3 myrinet ports were open.
Lydia
On Mon, 20 Nov 2006 users-requ...@open-mpi.org wrote:
>
> --
>
> Message: 2
> Date: Mon, 20 Nov 2006 20:05:22 + (GMT)
> From: Lydia Heck
> Subject: [OMPI users] myrinet mx and openm
I have built the myrinet drivers with gcc or the studio 11 compilers from sun.
The following problem appears for both installations.
I have tested the myrinet installations using myricoms own test programs.
Then I build open-mpi using the studio11 compilers enabling myrinet.
All the library pat
I have solved this problem myself.
The mx drivers are built using the gcc compilers both in 64 and 32 bit.
I was trying to build 64-bit openmpi on the sun and I am afraid I overlooked
that I had to give the path to the 64-bit gcc libs EXPLICITLY in the build
of the openmpi. These libraries were
I have myricom mx installed and configured and its communications work (using
mx commands such as mx_info to check)
Then I configured openmpi-1.3a1r12408 with mx and the configuration
did give no errors. The built of the openmpi was without problems and it
installed properly. I can build and link
> Could you try this without threads? We have tried to make the system work
> with threads, but our testing has been limited. First thing I would try is
> to make sure that we aren't hitting a thread-lock.
>
> Thanks
> Ralph
>
>
>
> On 10/20/06 2:11 AM, "
In answer to Ralph's request and question.
Indeed the version number was incorrect it should have been
openmpi-1.3a1r12121
my configure command is
#!/bin/ksh
CC="/opt/studio11/SUNWspro/bin/cc"
CFLAGS="-xarch=amd64a -I/opt/mx/include -I/opt/SUNWsge/include"
LDFLAGS="-xarch=amd64a -I/opt/m
I have recently installed openmpi 1.3r1212a over tcp and gigabit
on a Solaris 10 x86/64 system.
The compilation of some test codes
monte (a monte carlo estimate of pi),
connectivity which test connectivity between processes and nodes
prime, which calculates prime numbers (these testcode are exam
ependency-tracking \
--enable-cxx-exceptions \
--enable-smp-locks \
--enable-mpi-threads \
--enable-progress-threads \
--with-threads=solaris
On Tue, 17 Oct 2006, Lydia Heck wrote:
>
> I know that with 1.3a1 I a looking at a development release.
> HOwever I do need t
the same error.
Yes, mx is definitely installed, and yes the path to mx is definitely
/opt/mx ...
Any ideas
Lydia Heck
--
Dr E L Heck
University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road
DURHAM
My apologies I forgot to attach the config.log file.
On Thu, 21 Sep 2006, Lydia Heck wrote:
>
> I am trying to build openmpi-1.1.2 for Solaris x86/64 with the studio11
> compilers and including the mx drivers. I have gone past some hurdles.
> However when the configure script n
I am trying to build openmpi-1.1.2 for Solaris x86/64 with the studio11
compilers and including the mx drivers. I have gone past some hurdles.
However when the configure script nears its end where Makefiles are prepared
I get error messages of the form:
config.status: creating ompi/mca/osc/rdma/M
32 matches
Mail list logo