Intel compiler 11.0.074
OpenMPI 1.4.1
Two different OSes: centos 5.4 (2.6.18 kernel) and Fedora-12 (2.6.32 kernel)
Two different CPUs: Opteron 248 and Opteron 8356.
same binary for OpenMPI. Same binary for user code (vasp compiled for older
arch)
When I supply rankfile, then depending on combo
Hi!
I'm trying to implement checkpointing on out cluster, and I have obvious
question.
I guess this was implemented many times by other users, so I would like is
someone share experience with me.
With serial/multithreaded jobs it is kind of clear. But for parallel?
We have "fat" 16-core nodes,
I will try to prepare test-case.
--
Anton Starikov.
On May 12, 2009, at 6:57 PM, Edgar Gabriel wrote:
hm, so I am out of ideas. I created multiple variants of test-
programs which did what you basically described, and they all passed
and did not generate problems. I compiled the MUMPS
fo.txt.gz
Description: GNU Zip compressed data
--
Anton Starikov.
Computational Material Science,
Faculty of Science and Technology,
University of Twente.
Phone: +31 (0)53 489 2986
Fax: +31 (0)53 489 2910
On May 12, 2009, at 12:35 PM, Jeff Squyres wrote:
Can you send all the information listed her
" and "mpirun -np
5" both works, but in both cases there are only 4 tasks. It isn't
crucial, because there is nor real oversubscription, but there is
still some bug which can affect something in future.
--
Anton Starikov.
On May 12, 2009, at 1:45 AM, Ralph Castain wro
By the way, this if fortran code, which uses F77 bindings.
--
Anton Starikov.
On May 12, 2009, at 3:06 AM, Anton Starikov wrote:
Due to rankfile fixes I switched to SVN r21208, now my code dies
with error
[node037:20519] *** An error occurred in MPI_Comm_dup
[node037:20519] *** on
(your MPI job will now abort)
--
Anton Starikov.
(your MPI job will now abort)
--
Anton Starikov.
Although removing this check solves problem of having more slots in
rankfile than necessary, there is another problem.
If I set rmaps_base_no_oversubscribe=1 then if, for example:
hostfile:
node01
node01
node02
node02
rankfile:
rank 0=node01 slot=1
rank 1=node01 slot=0
rank 2=node02 slot=1
I can confirm that I have exactly the same problem, also on Dell
system, even with latest openpmpi.
Our system is:
Dell M905
OpenSUSE 11.1
kernel: 2.6.27.21-0.1-default
ofed-1.4-21.12 from SUSE repositories.
OpenMPI-1.3.2
But what I can also add, it not only affect openmpi, if this messages
10 matches
Mail list logo