Hi Craig, list
Independent of any issues with your GigE switch,
which you may need to address,
you may want to take a look at the performance of the default
OpenMPI MPI_Alltoall algorithm, which you say is a cornerstone of VASP.
You can perhaps try alternative algorithms for different message
siz
Patrick Geoffray wrote:
> Hi Oskar,
>
> Oskar Enoksson wrote:
>> The reason in this case was that cl120 had some kind of hardware
>> problem, perhaps memory error or myrinet NIC hardware error. The system
>> hung.
>>
>> I will try MX_ZOMBIE_SEND=0, thanks for the hint!
>
> I would not recommen
Most of that bandwidth is in marketing... Sorry, but it's not a high
performance switch.
Craig Plaisance wrote:
The switch we are using (Dell Powerconnect 6248) has a switching fabric
capacity of 184 Gb/s, which should be more than adequate for the 48
ports. Is this the same as backplane ban
Thanks, I'll let you know how it works out.
Kenneth
On Tue, 18 Aug 2009, Ralph Castain wrote:
Date: Tue, 18 Aug 2009 15:23:23 -0600
From: Ralph Castain
To: Kenneth Yoshimoto , Open MPI Users
Subject: Re: [OMPI users] orte_launch_agent usage?
I believe I found the problem on this and fixed
I believe I found the problem on this and fixed it. If you can, you
might try installing the nightly tarball and see if that solves the
problem. Anything on or after r21827 will include the fix.
Ralph
On Aug 12, 2009, at 4:04 PM, Kenneth Yoshimoto wrote:
If I use -mca orte_launch_agent /h
Hi,
i get this error when i use --rankfile,
"There are not enough slots available in the system to satisfy the 2 slots"
what could be the problem? I have tried using '*' for 'slot' param and
many other configs without any luck. Wihtout --rankfile everything
works fine. Will appriciate any help.
ma
Thanks. I will try your suggestions and get back to you as soon as possible.
Julia
--- On Tue, 8/18/09, Eugene Loh wrote:
From: Eugene Loh
Subject: Re: [OMPI users] MPI loop problem
To: "Open MPI Users"
List-Post: users@lists.open-mpi.org
Date: Tuesday, August 18, 2009, 2:38 PM
Is t
Craig Plaisance wrote:
So is this a problem with the physical switch (we need a better switch)
or with the configuration of the switch (we need to configure the switch
or configure the os to work with the switch)?
You may want to look if you are dropping packets somewhere. You can look
at the
So is this a problem with the physical switch (we need a better switch)
or with the configuration of the switch (we need to configure the switch
or configure the os to work with the switch)?
Hi Oskar,
Oskar Enoksson wrote:
The reason in this case was that cl120 had some kind of hardware
problem, perhaps memory error or myrinet NIC hardware error. The system
hung.
I will try MX_ZOMBIE_SEND=0, thanks for the hint!
I would not recommend to use that setting. It will affect performa
Craig Plaisance wrote:
The switch we are using (Dell Powerconnect 6248) has a switching fabric
capacity of 184 Gb/s, which should be more than adequate for the 48
ports. Is this the same as backplane bandwidth?
Yes. If you are getting the behavior you describe, you are not getting
all that
The switch we are using (Dell Powerconnect 6248) has a switching fabric
capacity of 184 Gb/s, which should be more than adequate for the 48
ports. Is this the same as backplane bandwidth?
Is the problem independent of the the number of MPI processes? (You
suggest this is the case.)
If so, does this problem show up even with np=1 (a single MPI process)?
If so, does the problem show up even if you turn MPI off?
If so, the problem would seem to be unrelated to the MPI implement
The OpenMPI version is
[julia.he@bob bin]$ mpirun --version
mpirun (Open MPI) 1.2.8
Report bugs to http://www.open-mpi.org/community/help/
The platform is
[julia.he@bob bin]$ uname -a
Linux bob.csi.cuny.edu 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux
Craig Plaisance wrote:
mpich2 now and post the results. So, does anyone know what causes the
wild oscillations in the throughput at larger message sizes and higher
network traffic? Thanks!
Your switch can't handle this amount of traffic on its backplane. We
have seen this often in similar
I ran a test of tcp using NetPIPE and got throughput of 850 Mb/s at
message sizes of 128 Kb. The latency was 50 us. At message sizes above
1000 Kb, the throughput oscillated wildly between 850 Mb/s and values as
low as 200 Mb/s. This test was done with no other network traffic. I
then ran f
Dear ALL,
I am trying to checkpoint MPI application using the self
component. I had a look at the OPEN MPI FT user's guide Draft 1.4. but is still
unsure.
I have installed openmpi as follows:
jean$ ./configure --prefix=/home/jean/openmpi/ --enable-debug
--enable-mpi-profile -
On Aug 18, 2009, at 10:59 AM, Oskar Enoksson wrote:
The question is, however, why is cl120 not acking messages? What
is the application? What MPI calls does this application use?
Scott
The reason in this case was that cl120 had some kind of hardware
problem, perhaps memory error or myrin
Scott Atchley wrote:
Long answer:
The messages below indicate that these processes were all trying to
send to cl120. It did not ack their messages after 1000 resend
attempts (each retry is attempted with a 0.5 second interval) which is
about 8.3 minutes (500 seconds).
The messages also
BTW: I did see one issue in your program. In the program that isn't working,
you declare the various input arrays for MPI_Comm_spawn_multiple, but only
the manager rank=0 ever initializes them. Thus, the other manager ranks were
passing random garbage down to the function.
Even though only the roo
By any chance did you have the flag on to check MPI parameters? I think we
have a bug in there that might be causing what you saw, but it would only be
active if you had requested that OMPI check parameters.
Thanks
Ralph
2009/8/18 Federico Golfrè Andreasi
> That's is you've done: the worker pr
Sorry, but there is no way to answer this question with what is given. What
is "my_sub" doing? Which version of OpenMPI are you talking about, and on
what platform?
On Tue, Aug 18, 2009 at 8:28 AM, Julia He wrote:
> Hi,
>
> I found that the subroutine call inside a loop did not return correct
Hi,
I found that the subroutine call inside a loop did not
return correct value after certain iterations. In order to simplify the
problem, the inputs to the subroutine are chosen to be constant, so the
output should be the same for every iteration on every computing node.
It is a fortran program,
Hi,
I found that the subroutine call inside a loop did not
return correct value after certain iterations. In order to simplify the
problem, the inputs to the subroutine are chosen to be constant, so the
output should be the same for every iteration on every computing node.
It is a fortran program,
That's is you've done: the worker program spawned and the two versions of
the manager that call the spawning.
I you find something wrong please let me know.
Thank you,
Federico
2009/8/18 Ralph Castain
>
>
> Only the root process needs to provide the info keys for spawning anything.
>
Craig Plaisance wrote:
Hi - I have compiled vasp 4.6.34 using the Intel fortran compiler 11.1
with openmpi 1.3.3 on a cluster of 104 nodes running Rocks 5.2 with two
quad core opterons connected by a Gbit ethernet. Running in parallel on
Latency of gigabit is likely your issue. Lower qualit
Gbit Ethernet is well known to perform poorly for fine grained code like
VASP. The latencies for Gbit Ethernet are much too high.
If you want good scaling in a cluster for VASP, you'll need to run
InfiniBand or some other high speed/ low latency network.
Jim
-Original Message-
From: use
Only the root process needs to provide the info keys for spawning anything.
If that isn't correct, then we have a bug.
Could you send us a code snippet that shows what you were doing?
Thanks
Ralph
2009/8/18 Federico Golfrè Andreasi
> I think I've solved my problem:
>
> in the previous c
I think I've solved my problem:
in the previous code the arguments of the MPI_Comm_spawn_multiple where
filled only by the "root" process not by all the process in the group. Now
all the ranks have all that information and the spawn is done correctly.
But I read on http://www.mpi-forum.org/docs/mp
On Aug 18, 2009, at 5:12 AM, Federico Golfrè Andreasi wrote:
In the info object I only set the "host" key (after creatig the
object with the MPI_Info_create).
I've modified my code to leave out that request and created the
array of Info object as an array of MPI_INFO_NULL but the problem is
In the info object I only set the "host" key (after creatig the object with
the MPI_Info_create).
I've modified my code to leave out that request and created the array of
Info object as an array of MPI_INFO_NULL but the problem is still the same.
The error is thrown only when running with more tha
31 matches
Mail list logo