Have you folks used a debugger such as TotalView or padb to look at these
stalls?
I ask because we discovered a long time ago that MPI collectives can "hang" in
the scenario you describe. It is caused by one rank falling behind, and then
never catching up due to resource allocations - i.e.., on
Hi,
I find the reason why the program is killed by operating system in the case
that the problem size is large.
It consumes more memory and leads to more memory swap.
This also degrade the program performance.
But, I cannot determine which function of the worker process causes the
problem.
I
Sudheer,
Locks in MPI don't mean mutexes, they mark the beginning and end of a
passive mode communication epoch. All MPI operations within an epoch
logically occur concurrently and must be non-conflicting. So, what
you're written below is incorrect: the get is not guaranteed to complete
unt
On Wed, Apr 13, 2011 at 2:49 PM, Barrett, Brian W wrote:
> This is mostly an issue of how MPICH2 and Open MPI implement lock/unlock.
> Some might call what I'm about to describe erroneous. I wrote the
> one-sided code in Open MPI and may be among those people.
>
> In both implementations, one-sid
This is mostly an issue of how MPICH2 and Open MPI implement lock/unlock.
Some might call what I'm about to describe erroneous. I wrote the
one-sided code in Open MPI and may be among those people.
In both implementations, one-sided communication is not necessarily truly
asynchronous. That is, t
Hello,
I am trying to better understand the semantics of passive synchronization in
one-sided communication calls. Doesn't MPI_Win_unlock()
block to ensure that all the preceeding RMA calls in that epoch have been
synced?
In that case, why is an undefined value returned when trying to load from a
Hello All,
I have been enjoying using Transparent CR in Open MPI for my research !
I have few questions regarding working of ompi-restart:
1. I there a fixed mapping of processes to resources when ompi-restart is
done.
2. Is there a way for the user to control it. If I am correct, ompi-restart
d
On Apr 13, 2011, at 10:19 AM, Jack Bryan wrote:
> Hi, I am using
>
> mpirun (Open MPI) 1.3.4
>
> But, I have these,
>
> orte-clean orted orte-ioforte-ps orterun
>
> Can they do the same thing ?
Unfortunately, no
>
> If I use them, will they use a lot of memory on each wo
On Apr 13, 2011, at 10:29 AM, Jack Bryan wrote:
> Hi ,
>
> If I cannot ssh to a worker node, it means that my program cannot work
> correctly ?
No, that's not true. People thought you were on a cluster using ssh as the
launcher. From prior notes, you were using Torque, so not being allowed
Hi,
I do not have qrsh
I have qrerunqrls qrttoppm qrun
Can they do the same thing ?
thanks
> From: re...@staff.uni-marburg.de
> Date: Wed, 13 Apr 2011 16:28:14 +0200
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] OMPI monitor each process behavior
>
> Am 13.04.2011 um 05:55 schr
Hi ,
If I cannot ssh to a worker node, it means that my program cannot work
correctly ?
I can run it on 32 nodes *4 cores/node parallel processes. But, for larger
parallel processes, 128 nodes * 1 cpu/node, it is killed by signal 9.
Is this a reason ?
thanks
> Date: Wed, 13 Apr 2011 05:59:1
Hi, I am using
mpirun (Open MPI) 1.3.4
But, I have these,
orte-clean orted orte-ioforte-ps orterun
Can they do the same thing ?
If I use them, will they use a lot of memory on each worker node and print out
a lot of things on some log files ?
Any help is really appreciated.
T
The 16 cores refers to x3755-m2s. We have a mix of 3550s and 3755s in
the cluster.
It could be memory, but I think not. The jobs are well within memory
capacity, and the memory is mainly static. If out of memory then the
jobs would be first candidate for the job. Larger jobs run on the 3755s
w
Inline
-Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Stergiou, Jonathan C CIV NSWCCD West Bethesda,6640
> Sent: 13 April 2011 16:52
> To: Open MPI Users
> Subject: Re: [OMPI users] Over committing?
>
> Martin,
>
> We have seen simil
Am 13.04.2011 um 17:09 schrieb Rushton Martin:
> Version 1.3.2
>
> Consider a job that will run with 28 processes. The user submits it
> with:
>
> $ qsub -l nodes=4:ppn=7 ...
>
> which reserves 7 cores on (in this case) each of x3550x014 x3550x015 and
> x3550x016 x3550x020. Torque generates a
Martin,
We have seen similar behavior when using certain codes. CodeA can run at ppn=8
with no problem, but CodeB will run much more slowly (or hang) with ppn=8;
instead we use ppn=7 for CodeB.
This becomes complicated when we run CodeA and CodeB together (coupled
simulations). It requires
I'm afraid I can't comment on how OMPI was configured, "as supplied by
the suppliers"! The users experiencing these problems use the Intel
bindings, loaded via the modules command. We are running CentOS 5.3.
Martin Rushton
HPC System Manager, Weapons Technologies
Tel: 01959 514777, Mobile: 0793
Afraid I have no idea - we regularly run on Torque machines with the nodes
fully populated. While most runs are only for a few hours, some runs go for
days.
How was OMPI configured? What OS version?
On Apr 13, 2011, at 9:09 AM, Rushton Martin wrote:
> Version 1.3.2
>
> Consider a job that w
Version 1.3.2
Consider a job that will run with 28 processes. The user submits it
with:
$ qsub -l nodes=4:ppn=7 ...
which reserves 7 cores on (in this case) each of x3550x014 x3550x015 and
x3550x016 x3550x020. Torque generates a file (PBS_NODEFILE) which lists
each node 7 times.
The mpirun co
On Apr 13, 2011, at 8:13 AM, Rushton Martin wrote:
> The bulk of our compute nodes are 8 cores (twin 4-core IBM x3550-m2).
> Jobs are submitted by Torque/MOAB. When run with up to np=8 there is
> good performance. Attempting to run with more processors brings
> problems, specifically if any one
Am 13.04.2011 um 05:55 schrieb Jack Bryan:
> I need to monitor the memory usage of each parallel process on a linux Open
> MPI cluster.
>
> But, top, ps command cannot help here because they only show the head node
> information.
>
> I need to follow the behavior of each process on each clus
The bulk of our compute nodes are 8 cores (twin 4-core IBM x3550-m2).
Jobs are submitted by Torque/MOAB. When run with up to np=8 there is
good performance. Attempting to run with more processors brings
problems, specifically if any one node of a group of nodes has all 8
cores in use the job hang
What version are you using? If you are using 1.5.x, there is an "orte-top"
command that will do what you ask. It queries the daemons to get the info.
On Apr 12, 2011, at 9:55 PM, Jack Bryan wrote:
> Hi , All:
>
> I need to monitor the memory usage of each parallel process on a linux Open
> M
amosl...@gmail.com wrote:
Hi,
I am embarrassed! I submitted a note to the users on setting
up openmpi-1.4.3 using SUSE-11.3 under Linux and received several
replies. I wanted to transfer them but they disappeared for no
apparent reason. I hope that those that sent me messages wil
On 4/12/2011 8:55 PM, Jack Bryan wrote:
I need to monitor the memory usage of each parallel process on a linux
Open MPI cluster.
But, top, ps command cannot help here because they only show the head
node information.
I need to follow the behavior of each process on each cluster node.
Did you
All,
It looks like the issue is solved. Our sysadmin had been working on the issue
too - he noticed a lot of "junk" in my /etc/ld.so.conf.d/ directory. After
"cleaning" it out (I think he ended up wiping everything out, then rebooting
the machine, then re-configuring specific items as needed)
Hi Rainer,
When executing "mpirun blacs_hello_example.exe" (reference:
http://www.netlib.org/blacs/BLACS/Examples.html#HELLO), I am now getting
folloing error...
<<
C:\blacs_examples>mpirun blacs_hello_example.exe
forrtl: severe (157): Program Exception - access violation
Image P
Hi Rainer,
Thanks for acknowledgment.
> You may want to port/compile BLACS from netlib yourselve, see here:
> http://icl.cs.utk.edu/lapack-for-windows/VisualStudio_install.html
With that I am seeing compilation errors as reported in...
http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=12&t=2354
28 matches
Mail list logo