runs on node aa, bound to socket 1, cores 0-2.
>>>>>>>> Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
>>>>>>>> Rank 2 runs on node cc, bound to cores 1 and 2.
>>>>>>>>
>>>>>>>> Does it mean that the process with rank 0 should be bo
On 09/07/2012 08:02 AM, Jeff Squyres wrote:
On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote:
Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it
always the same node that crashes? And so on.
Another thought on hardware errors... I actually have seen bad RAM cause
On 09/03/2012 04:39 PM, Andrea Negri wrote:
max locked memory (kbytes, -l) 32
max memory size(kbytes, -m) unlimited
open files (-n) 1024
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
s
On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote:
> Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is
> it always the same node that crashes? And so on.
Another thought on hardware errors... I actually have seen bad RAM cause
spontaneous reboots with no Linux warnings
On Sep 5, 2012, at 3:59 AM, Andrea Negri wrote:
> I have tried with these flags (I use gcc 4.7 and open mpi 1.6), but
> the program doesn't crash, a node go down and the rest of them remain
> to wait a signal (there is an ALLREDUCE in the code).
>
> Anyway, yesterday some processes died (without
gt;>>> wonder if there's some kind of bad interaction going on here...
>>>>>
>>>>> The transparent hugepage is "transparent", which means it is
>>>>> automatically applied to all applications unless it is explicitly told
>>&g
Andrea,
As suggested by the previous answers I guess the size of your problem is too
large for the memory available on the nodes. I can runs ZeusMP without any
issues up to 64 processes, both over Ethernet and Infiniband. I tried the 1.6
and the current trunk, and both perform as expected.
Wha
ls
>>> /var/log/messages return
>>> acpid cron.4 messages.3 secure.4
>>> anaconda.logcupsmessages.4 spooler
>>> anaconda.syslog dmesg mpi_uninstall.log spooler.1
>>&g
ohn Hearns)
2. Re: users Digest, Vol 2339, Issue 5 (Andrea Negri)
--
Message: 1
Date: Sat, 1 Sep 2012 08:48:56 +0100
From: John Hearns
Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
--
>> There are no nodes allocated to this job.
>> ------------------
>>
>> all the time.
>>
>> ==
>>
>> I configured with:
>>
>> ./config
--
>
> all the time.
>
> ==
>
> I configured with:
>
> ./configure --prefix=$HOME/local/... --enable-static --disable-shared
> --with-sge
>
> and adjusted my PATHs accordingly (at least: I hope so).
>
> -- Reuti
>
>
> --
>
> Messag
339, Issue 5 (Andrea Negri)
>>
>>
>> ----------
>>
>> Message: 1
>> Date: Sat, 1 Sep 2012 08:48:56 +0100
>> From: John Hearns
>> Subject: Re: [OMPI users] some mpi processes "disap
-------------
>
> Message: 1
> Date: Sat, 1 Sep 2012 08:48:56 +0100
> From: John Hearns
> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
> of servers
> To: Open MPI Users
> Message-ID:
&
I have tried to run with a single process (i.e. the entire grid is
contained by one process) and the the command free -m on the compute
node returns
total used free sharedbuffers cached
Mem: 3913 1540 2372 0 49 1234
-
Apologies, I have not taken the time to read your comprehensive diagnostics!
As Gus says, this sounds like a memory problem.
My suspicion would be the kernel Out Of Memory (OOM) killer.
Log into those nodes (or ask your systems manager to do this). Look
closely at /var/log/messages where there wil
Hi Andrea
I would guess this is a memory problem.
Do you know how much memory each node has?
Do you know the memory that
each MPI process in the CFD code requires?
If the program starts swapping/paging into disk, because of
low memory, those interesting things that you described can happen.
I wo
Hi, I have been in trouble for a year.
I run a pure MPI (no openMP) Fortran fluid dynamical code on a cluster
of server, and I obtain a strange behaviour by running the code on
multiple nodes.
The cluster is formed by 16 pc (1 pc is a node) with a dual core processor.
Basically, I'm able to run th
17 matches
Mail list logo