-Original Message-
if the interface is down, should localhost still allow mpirun to run mpi
processes?
In the last few years, it has been very simple to
set up high-performance (GbE) multiple back-to-back
connections between three nodes (triangular topology)
or four nodes (tetrahedral topology).
The only things you had to do was
- use 3 (or 4) cheap compute nodes w/Linux and connect
each of them
Open MPI (trunk/1.7 - not 1.4 or 1.5) provides an application level
interface to request a checkpoint of an application. This API is
defined on the following website:
http://osl.iu.edu/research/ft/ompi-cr/api.php#api-cr_checkpoint
This will behave the same as if you requested the checkpoint of t
FWIW: I have tracked this problem down. The fix is a little more complicated
then I'd like, so I'm going to have to ping some other folks to ensure we
concur on the approach before doing something.
On Oct 25, 2011, at 8:20 AM, Ralph Castain wrote:
> I still see it failing the test George provid
Hi Mouhamad
The locked memory is set to unlimited, but the lines
about the stack are commented out.
Have you tried to add this line:
* - stack -1
then run wrf again? [Note no "#" hash character]
Also, if you login to the compute nodes,
what is the output of 'limit' [csh,tcsh] or 'uli
Hi all,
I've checked the "limits.conf", and it contains theses lines
# Jcb 29.06.2007 : pbs wrf (Siji)
#* hardstack 100
#* softstack 100
# Dr 14.02.2008 : pour voltaire mpi
* hardmemlock unlimited
* softmemlock unlimited
Many thanks for you
Hi Mouhamad, Ralph, Terry
Very often big programs like wrf crash with segfault because they
can't allocate memory on the stack, and assume the system doesn't
impose any limits for it. This has nothing to do with MPI.
Mouhamad: Check if your stack size is set to unlimited on all compute
nodes.
Looks like you are crashing in wrf - have you asked them for help?
On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:
> Hi again,
>
> This is exactly the error I have:
>
>
> taskid: 0 hostname: part034.u-bourgogne.fr
> [part034:21443] *** Process received signal ***
> [part034:21443
My best guess is that you are seeing differences in scheduling behavior with
respect to memory locale. I notice that you are not binding your processes, and
so they are free to move around the various processors on the node. I would
guess that your thread is winding up on a processor that is non
This looks more like a seg fault in wrf and not OMPI.
Sorry not much I can do here to help you.
--td
On 10/25/2011 9:53 AM, Mouhamad Al-Sayed-Ali wrote:
Hi again,
This is exactly the error I have:
taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal **
I still see it failing the test George provided on the trunk. I'm unaware of
anyone looking further into it, though, as the prior discussion seemed to just
end.
On Oct 25, 2011, at 7:01 AM, orel wrote:
> Dears,
>
> I try from several days to use advanced MPI2 features in the following
> scena
Hi again,
This is exactly the error I have:
taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
Hello
can you run wrf successfully on one node?
NO, It can't run on one node
Can you run a simple code across your two nodes? I would try
hostname then some simple MPI program like the ring example.
Yes, I can run a simple code
many thanks
Mouhamad
Can you run wrf successfully on one node?
Can you run a simple code across your two nodes? I would try hostname
then some simple MPI program like the ring example.
--td
On 10/25/2011 9:05 AM, Mouhamad Al-Sayed-Ali wrote:
hello,
-What version of ompi are you using
I am using ompi version
hello,
-What version of ompi are you using
I am using ompi version 1.4.1-1 compiled with gcc 4.5
-What type of machine and os are you running on
I'm using linux machine 64 bits.
-What does the machine file look like
part033
part033
part031
part031
-Is there a stack trace le
Dears,
I try from several days to use advanced MPI2 features in the following
scenario :
1) a master code A (of size NPA) spawns (MPI_Comm_spawn()) two slave
codes B (of size NPB) and C (of size NPC), providing intercomms
A-B and A-C ;
2) i create intracomm AB and AC by merging inte
Some more info would be nice like:
-What version of ompi are you using
-What type of machine and os are you running on
-What does the machine file look like
-Is there a stack trace left behind by the pid that seg faulted?
--td
On 10/25/2011 8:07 AM, Mouhamad Al-Sayed-Ali wrote:
Hello,
I have t
Hello,
I have tried to run the executable "wrf.exe", using
mpirun -machinefile /tmp/108388.1.par2/machines -np 4 wrf.exe
but, I've got the following error:
--
mpirun noticed that process rank 1 with PID 9942 on node
pa
Hello,
I have tried to run the executable "wrf.exe", using
mpirun -machinefile /tmp/108388.1.par2/machines -np 4 wrf.exe
but, I've got the following error:
--
mpirun noticed that process rank 1 with PID 9942 on node
pa
Hello Meredith,
Yes, I have tried the plugin already. The problem is that the plugin seems to
be forever stuck in "Waiting for job information" stage. I scouted around a bit
on how to solve the problem, and it did not seem straightforward. At least, the
solution to me seemed like a one-time won
Thanks, Ralph. Yes, I have taking that into account. The problem is not to
compare two proc with one proc, but the "multi-threading effect".
Multi-threading is good on the first machine for one and two proc, but on
the second machine, it disappears for two proc.
To narrow down the problem, I reins
Okay - thanks for testing it.
Of course, one obvious difference is that there isn't any communication when
you run only one proc, but there is when you run two or more, assuming your
application has MPI send/recv (or calls collective and other functions that
communicate) calls in it. Communicat
22 matches
Mail list logo