[OMPI users] Infiniband performance Problem and stalling

2012-08-31 Thread Randolph Pullen
(reposted with consolidated information)
I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III
HCA 10G cards
running Centos 5.7 Kernel 2.6.18-274
Open MPI 1.4.3
MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
On a Cisco 24 pt switch
 
Normal performance is:
$ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts
 PingPong
results in:
 Max rate = 958.388867 MB/sec  Min latency = 4.529953
usec
and:
$ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts
 PingPong
Max rate = 653.547293 MB/sec  Min latency = 19.550323 usec
 
NetPipeMPI  results
show a max of 7.4 Gb/s at 8388605 bytes which seems fine.
log_num_mtt =20 and log_mtts_per_seg params
=2
 
My application exchanges about a gig of data between the processes
with 2 sender and 2 consumer processes on each node with 1 additional controller
process on the starting node. 
The program splits the data into 64K blocks and uses non blocking
sends and receives with busy/sleep loops to monitor progress until completion.
Each process owns a single buffer for these 64K blocks.
 
 
My problem is I see better performance under IPoIB then I do on
native IB (RDMA_CM).
My understanding is that IPoIB is limited to about 1G/s so I am at
a loss to know why it is faster.
 
These 2 configurations are equivelant (about 8-10 seconds per
cycle)
mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl
tcp,self -H vh2,vh1 -np 9 --bycore prog
mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl
tcp,self -H vh2,vh1 -np 9 --bycore prog
 
And this one produces similar run times but seems to degrade with
repeated cycles:
mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1
--mca btl openib,self -H vh2,vh1 -np 9 --bycore  prog
 
Other  btl_openib_flags settings result in much lower
performance. 
Changing the first of the above configs to use openIB results in a
21 second run time at best.  Sometimes it takes up to 5 minutes.
In all cases, OpenIB runs in twice the time it takes TCP,except if
I push the small message max to 64K and force short messages.  Then the
openib times are the same as TCP and no faster.
 
With openib:
- Repeated cycles during a single run seem to slow down with each
cycle
(usually by about 10 seconds).
- On occasions it seems to stall indefinitely, waiting on a single
receive. 
 
I'm  still at a loss
as to why.  I can’t find any errors
logged during the runs.
Any ideas appreciated.
 
Thanks in advance,
Randolph

Re: [OMPI users] Accessing data member of MPI_File struct

2012-08-31 Thread Jeff Squyres
On Aug 30, 2012, at 11:35 PM, Ammar Ahmad Awan wrote:

> My real problem is that I want to access the fields from the MPI_File 
> structure other than the ones provided by the API e.g. the fd_sys.  
> 
> Atomicity was just one example I used to explain my problem. If MPI_File is 
> an opaque structure, is there any other way or any other structure through 
> which I can reach the fields?

Nope.  The whole point is that MPI_File is a handle to an opaque data structure 
on the back end.  Using the API functions is the only portable way to get to 
the data.

I see that you asked exactly the same question on both the Open MPI and MPICH2 
lists at the same time -- know that our back-end data structures are different. 
 Hence, even if you could access the fields of Open MPI, that wouldn't help you 
with MPICH2.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] [SPAM:### 83%] Re:

2012-08-31 Thread Mudassar Majeed

Come on   http://joel-caserus.info/abc.news.php?abusiness=65b5


Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2012-08-31 Thread Brian Budge
Hi Ralph -

This is true, but we may not know until well into the process whether
we need MPI at all.  We have an SMP/NUMA mode that is designed to run
faster on a single machine.  We also may build our application on
machines where there is no MPI, and we simply don't build the code
that runs the MPI functionality in that case.  We have scripts all
over the place that need to start this application, and it would be
much easier to be able to simply run the program than to figure out
when or if mpirun needs to be starting the program.

Before, we went so far as to fork and exec a full mpirun when we run
in clustered mode.  This resulted in an additional process running,
and we had to use sockets to get the data to the new master process.
I very much like the idea of being able to have our process become the
MPI master instead, so I have been very excited about your work around
this singleton fork/exec under the hood.

Once I get my new infrastructure designed to work with mpirun -n 1 +
spawn, I will try some previous openmpi versions to see if I can find
a version with this singleton functionality in-tact.

Thanks again,
  Brian

On Thu, Aug 30, 2012 at 4:51 PM, Ralph Castain  wrote:
> not off the top of my head. However, as noted earlier, there is absolutely no 
> advantage to a singleton vs mpirun start - all the singleton does is 
> immediately fork/exec "mpirun" to support the rest of the job. In both cases, 
> you have a daemon running the job - only difference is in the number of 
> characters the user types to start it.
>
>
> On Aug 30, 2012, at 8:44 AM, Brian Budge  wrote:
>
>> In the event that I need to get this up-and-running soon (I do need
>> something working within 2 weeks), can you recommend an older version
>> where this is expected to work?
>>
>> Thanks,
>>  Brian
>>
>> On Tue, Aug 28, 2012 at 4:58 PM, Brian Budge  wrote:
>>> Thanks!
>>>
>>> On Tue, Aug 28, 2012 at 4:57 PM, Ralph Castain  wrote:
 Yeah, I'm seeing the hang as well when running across multiple machines. 
 Let me dig a little and get this fixed.

 Thanks
 Ralph

 On Aug 28, 2012, at 4:51 PM, Brian Budge  wrote:

> Hmmm, I went to the build directories of openmpi for my two machines,
> went into the orte/test/mpi directory and made the executables on both
> machines.  I set the hostsfile in the env variable on the "master"
> machine.
>
> Here's the output:
>
> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
> ./simple_spawn
> Parent [pid 97504] starting up!
> 0 completed MPI_Init
> Parent [pid 97504] about to spawn!
> Parent [pid 97507] starting up!
> Parent [pid 97508] starting up!
> Parent [pid 30626] starting up!
> ^C
> zsh: interrupt  OMPI_MCA_orte_default_hostfile= ./simple_spawn
>
> I had to ^C to kill the hung process.
>
> When I run using mpirun:
>
> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
> mpirun -np 1 ./simple_spawn
> Parent [pid 97511] starting up!
> 0 completed MPI_Init
> Parent [pid 97511] about to spawn!
> Parent [pid 97513] starting up!
> Parent [pid 30762] starting up!
> Parent [pid 30764] starting up!
> Parent done with spawn
> Parent sending message to child
> 1 completed MPI_Init
> Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513
> 0 completed MPI_Init
> Hello from the child 0 of 3 on host budgeb-interlagos pid 30762
> 2 completed MPI_Init
> Hello from the child 2 of 3 on host budgeb-interlagos pid 30764
> Child 1 disconnected
> Child 0 received msg: 38
> Child 0 disconnected
> Parent disconnected
> Child 2 disconnected
> 97511: exiting
> 97513: exiting
> 30762: exiting
> 30764: exiting
>
> As you can see, I'm using openmpi v 1.6.1.  I just barely freshly
> installed on both machines using the default configure options.
>
> Thanks for all your help.
>
> Brian
>
> On Tue, Aug 28, 2012 at 4:39 PM, Ralph Castain  wrote:
>> Looks to me like it didn't find your executable - could be a question of 
>> where it exists relative to where you are running. If you look in your 
>> OMPI source tree at the orte/test/mpi directory, you'll see an example 
>> program "simple_spawn.c" there. Just "make simple_spawn" and execute 
>> that with your default hostfile set - does it work okay?
>>
>> It works fine for me, hence the question.
>>
>> Also, what OMPI version are you using?
>>
>> On Aug 28, 2012, at 4:25 PM, Brian Budge  wrote:
>>
>>> I see.  Okay.  So, I just tried removing the check for universe size,
>>> and set the universe size to 2.  Here's my output:
>>>
>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib
>>> OMPI_MCA_orte_default_host

Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2012-08-31 Thread Ralph Castain
I see - well, I hope to work on it this weekend and may get it fixed. If I do, 
I can provide you with a patch for the 1.6 series that you can use until the 
actual release is issued, if that helps.


On Aug 31, 2012, at 2:33 PM, Brian Budge  wrote:

> Hi Ralph -
> 
> This is true, but we may not know until well into the process whether
> we need MPI at all.  We have an SMP/NUMA mode that is designed to run
> faster on a single machine.  We also may build our application on
> machines where there is no MPI, and we simply don't build the code
> that runs the MPI functionality in that case.  We have scripts all
> over the place that need to start this application, and it would be
> much easier to be able to simply run the program than to figure out
> when or if mpirun needs to be starting the program.
> 
> Before, we went so far as to fork and exec a full mpirun when we run
> in clustered mode.  This resulted in an additional process running,
> and we had to use sockets to get the data to the new master process.
> I very much like the idea of being able to have our process become the
> MPI master instead, so I have been very excited about your work around
> this singleton fork/exec under the hood.
> 
> Once I get my new infrastructure designed to work with mpirun -n 1 +
> spawn, I will try some previous openmpi versions to see if I can find
> a version with this singleton functionality in-tact.
> 
> Thanks again,
>  Brian
> 
> On Thu, Aug 30, 2012 at 4:51 PM, Ralph Castain  wrote:
>> not off the top of my head. However, as noted earlier, there is absolutely 
>> no advantage to a singleton vs mpirun start - all the singleton does is 
>> immediately fork/exec "mpirun" to support the rest of the job. In both 
>> cases, you have a daemon running the job - only difference is in the number 
>> of characters the user types to start it.
>> 
>> 
>> On Aug 30, 2012, at 8:44 AM, Brian Budge  wrote:
>> 
>>> In the event that I need to get this up-and-running soon (I do need
>>> something working within 2 weeks), can you recommend an older version
>>> where this is expected to work?
>>> 
>>> Thanks,
>>> Brian
>>> 
>>> On Tue, Aug 28, 2012 at 4:58 PM, Brian Budge  wrote:
 Thanks!
 
 On Tue, Aug 28, 2012 at 4:57 PM, Ralph Castain  wrote:
> Yeah, I'm seeing the hang as well when running across multiple machines. 
> Let me dig a little and get this fixed.
> 
> Thanks
> Ralph
> 
> On Aug 28, 2012, at 4:51 PM, Brian Budge  wrote:
> 
>> Hmmm, I went to the build directories of openmpi for my two machines,
>> went into the orte/test/mpi directory and made the executables on both
>> machines.  I set the hostsfile in the env variable on the "master"
>> machine.
>> 
>> Here's the output:
>> 
>> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
>> ./simple_spawn
>> Parent [pid 97504] starting up!
>> 0 completed MPI_Init
>> Parent [pid 97504] about to spawn!
>> Parent [pid 97507] starting up!
>> Parent [pid 97508] starting up!
>> Parent [pid 30626] starting up!
>> ^C
>> zsh: interrupt  OMPI_MCA_orte_default_hostfile= ./simple_spawn
>> 
>> I had to ^C to kill the hung process.
>> 
>> When I run using mpirun:
>> 
>> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
>> mpirun -np 1 ./simple_spawn
>> Parent [pid 97511] starting up!
>> 0 completed MPI_Init
>> Parent [pid 97511] about to spawn!
>> Parent [pid 97513] starting up!
>> Parent [pid 30762] starting up!
>> Parent [pid 30764] starting up!
>> Parent done with spawn
>> Parent sending message to child
>> 1 completed MPI_Init
>> Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513
>> 0 completed MPI_Init
>> Hello from the child 0 of 3 on host budgeb-interlagos pid 30762
>> 2 completed MPI_Init
>> Hello from the child 2 of 3 on host budgeb-interlagos pid 30764
>> Child 1 disconnected
>> Child 0 received msg: 38
>> Child 0 disconnected
>> Parent disconnected
>> Child 2 disconnected
>> 97511: exiting
>> 97513: exiting
>> 30762: exiting
>> 30764: exiting
>> 
>> As you can see, I'm using openmpi v 1.6.1.  I just barely freshly
>> installed on both machines using the default configure options.
>> 
>> Thanks for all your help.
>> 
>> Brian
>> 
>> On Tue, Aug 28, 2012 at 4:39 PM, Ralph Castain  wrote:
>>> Looks to me like it didn't find your executable - could be a question 
>>> of where it exists relative to where you are running. If you look in 
>>> your OMPI source tree at the orte/test/mpi directory, you'll see an 
>>> example program "simple_spawn.c" there. Just "make simple_spawn" and 
>>> execute that with your default hostfile set - does it work okay?
>>> 
>>> It works f

Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2012-08-31 Thread Brian Budge
Thanks, much appreciated.

On Fri, Aug 31, 2012 at 2:37 PM, Ralph Castain  wrote:
> I see - well, I hope to work on it this weekend and may get it fixed. If I 
> do, I can provide you with a patch for the 1.6 series that you can use until 
> the actual release is issued, if that helps.
>
>
> On Aug 31, 2012, at 2:33 PM, Brian Budge  wrote:
>
>> Hi Ralph -
>>
>> This is true, but we may not know until well into the process whether
>> we need MPI at all.  We have an SMP/NUMA mode that is designed to run
>> faster on a single machine.  We also may build our application on
>> machines where there is no MPI, and we simply don't build the code
>> that runs the MPI functionality in that case.  We have scripts all
>> over the place that need to start this application, and it would be
>> much easier to be able to simply run the program than to figure out
>> when or if mpirun needs to be starting the program.
>>
>> Before, we went so far as to fork and exec a full mpirun when we run
>> in clustered mode.  This resulted in an additional process running,
>> and we had to use sockets to get the data to the new master process.
>> I very much like the idea of being able to have our process become the
>> MPI master instead, so I have been very excited about your work around
>> this singleton fork/exec under the hood.
>>
>> Once I get my new infrastructure designed to work with mpirun -n 1 +
>> spawn, I will try some previous openmpi versions to see if I can find
>> a version with this singleton functionality in-tact.
>>
>> Thanks again,
>>  Brian
>>
>> On Thu, Aug 30, 2012 at 4:51 PM, Ralph Castain  wrote:
>>> not off the top of my head. However, as noted earlier, there is absolutely 
>>> no advantage to a singleton vs mpirun start - all the singleton does is 
>>> immediately fork/exec "mpirun" to support the rest of the job. In both 
>>> cases, you have a daemon running the job - only difference is in the number 
>>> of characters the user types to start it.
>>>
>>>
>>> On Aug 30, 2012, at 8:44 AM, Brian Budge  wrote:
>>>
 In the event that I need to get this up-and-running soon (I do need
 something working within 2 weeks), can you recommend an older version
 where this is expected to work?

 Thanks,
 Brian

 On Tue, Aug 28, 2012 at 4:58 PM, Brian Budge  wrote:
> Thanks!
>
> On Tue, Aug 28, 2012 at 4:57 PM, Ralph Castain  wrote:
>> Yeah, I'm seeing the hang as well when running across multiple machines. 
>> Let me dig a little and get this fixed.
>>
>> Thanks
>> Ralph
>>
>> On Aug 28, 2012, at 4:51 PM, Brian Budge  wrote:
>>
>>> Hmmm, I went to the build directories of openmpi for my two machines,
>>> went into the orte/test/mpi directory and made the executables on both
>>> machines.  I set the hostsfile in the env variable on the "master"
>>> machine.
>>>
>>> Here's the output:
>>>
>>> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
>>> ./simple_spawn
>>> Parent [pid 97504] starting up!
>>> 0 completed MPI_Init
>>> Parent [pid 97504] about to spawn!
>>> Parent [pid 97507] starting up!
>>> Parent [pid 97508] starting up!
>>> Parent [pid 30626] starting up!
>>> ^C
>>> zsh: interrupt  OMPI_MCA_orte_default_hostfile= ./simple_spawn
>>>
>>> I had to ^C to kill the hung process.
>>>
>>> When I run using mpirun:
>>>
>>> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
>>> mpirun -np 1 ./simple_spawn
>>> Parent [pid 97511] starting up!
>>> 0 completed MPI_Init
>>> Parent [pid 97511] about to spawn!
>>> Parent [pid 97513] starting up!
>>> Parent [pid 30762] starting up!
>>> Parent [pid 30764] starting up!
>>> Parent done with spawn
>>> Parent sending message to child
>>> 1 completed MPI_Init
>>> Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513
>>> 0 completed MPI_Init
>>> Hello from the child 0 of 3 on host budgeb-interlagos pid 30762
>>> 2 completed MPI_Init
>>> Hello from the child 2 of 3 on host budgeb-interlagos pid 30764
>>> Child 1 disconnected
>>> Child 0 received msg: 38
>>> Child 0 disconnected
>>> Parent disconnected
>>> Child 2 disconnected
>>> 97511: exiting
>>> 97513: exiting
>>> 30762: exiting
>>> 30764: exiting
>>>
>>> As you can see, I'm using openmpi v 1.6.1.  I just barely freshly
>>> installed on both machines using the default configure options.
>>>
>>> Thanks for all your help.
>>>
>>> Brian
>>>
>>> On Tue, Aug 28, 2012 at 4:39 PM, Ralph Castain  
>>> wrote:
 Looks to me like it didn't find your executable - could be a question 
 of where it exists relative to where you are running. If you look in 
 your OMPI source tree at the orte/test/mpi directory, y

[OMPI users] some mpi processes "disappear" on a cluster of servers

2012-08-31 Thread Andrea Negri
Hi, I have been in trouble for a year.

I run a pure MPI (no openMP) Fortran fluid dynamical code on a cluster
of server, and I obtain a strange behaviour by running the code on
multiple nodes.
The cluster is formed by 16 pc (1 pc is a node) with a dual core processor.
Basically, I'm able to run the code from the login node with the command:
mpirun  --mca btl_base_verbose 100 --mca backtrace_base_verbose 100
--mca memory_base_verbose 100 --mca sysinfo_base_verbose 100  -nolocal
-hostfile ./host_file -n 10  ./zeusmp2.x >> zmp_errors 2>&1
by selecting one process per core (i.e. in this case I use 5 nodes)

The code starts, and it runs correctely for some time.
After that, an entire node (sometimes two) "disappears" and it cannot
be reached with the ssh command, which returns: No route to host.
Sometimes the node is still reachable, but the two processes that was
running on the node are disappears.
In addition, on the other nodes, the others processes are still running.

If I have a look on the output and error file of mpirun, the following
error is present: [btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: No route to host (113)

PS: I'm not the admin of the cluster, I've installed the gcc and
openmpi on my own because the complier aviable on that machine are 8
years old.


I post here some information, if you want other info, you have only to
tell me which command I have to type on the bash and I will
immediately reply.


complier: gcc 4.7 (which was also used to compile openmpi)
openmpi version: 1.6

output of "ompi_info --all" from the node where I launch mpirun (which
is also the login node of the cluster)

  Package: Open MPI and...@cloud.bo.astro.it Distribution
Open MPI: 1.6
   Open MPI SVN revision: r26429
   Open MPI release date: May 10, 2012
Open RTE: 1.6
   Open RTE SVN revision: r26429
   Open RTE release date: May 10, 2012
OPAL: 1.6
   OPAL SVN revision: r26429
   OPAL release date: May 10, 2012
 MPI API: 2.1
Ident string: 1.6
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.6)
  MCA memory: linux (MCA v2.0, API v2.0, Component v1.6)
   MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.6)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.6)
   MCA shmem: mmap (MCA v2.0, API v2.0, Component v1.6)
   MCA shmem: posix (MCA v2.0, API v2.0, Component v1.6)
   MCA shmem: sysv (MCA v2.0, API v2.0, Component v1.6)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6)
   MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.6)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.6)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.6)
 MCA sysinfo: linux (MCA v2.0, API v2.0, Component v1.6)
   MCA hwloc: hwloc132 (MCA v2.0, API v2.0, Component v1.6)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.6)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.6)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.6)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.6)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.6)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.6)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.6)
MCA coll: self (MCA v2.0, API v2.0, Component v1.6)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.6)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.6)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.6)
  MCA io: romio (MCA v2.0, API v2.0, Component v1.6)
   MCA mpool: fake (MCA v2.0, API v2.0, Component v1.6)
   MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.6)
   MCA mpool: sm (MCA v2.0, API v2.0, Component v1.6)
 MCA pml: bfo (MCA v2.0, API v2.0, Component v1.6)
 MCA pml: csum (MCA v2.0, API v2.0, Component v1.6)
 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.6)
 MCA pml: v (MCA v2.0, API v2.0, Component v1.6)
 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.6)
  MCA rcache: vma (MCA v2.0, API v2.0, Component v1.6)
 MCA btl: self (MCA v2.0, API v2.0, Component v1.6)
 MCA btl: sm (MCA v2.0, API v2.0, Component v1.6)
 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6)
MCA topo: unity (MCA v2.0, API v2.0, Component v1.6)
 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.6)
 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.6)
 MCA iof: h

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-08-31 Thread Gus Correa

Hi Andrea

I would guess this is a memory problem.
Do you know how much memory each node has?
Do you know the memory that
each MPI process in the CFD code requires?
If the program starts swapping/paging into disk, because of
low memory, those interesting things that you described can happen.

I would login to the compute nodes and monitor the
amount of memory being used with "top" right after the program
starts to run.  If there is a pattern of which node tends to fail,
track login to that fail-prone node and monitor it.

Alternatively, if you cluster is running Ganglia,
you can see the memory use graphically,
in the Ganglia web page in a web browser.

If your cluster
doesn't allow direct user logins to compute nodes,
you can ask the system administrator to do this for you.

It may well be that the code has a memory leak, or that
it has a memory request spike, which may not be caught by
OpenMPI.
[Jeff and Ralph will probably correct me soon for
saying this, and I know the OpenMPI safeguards against
process misbehavior are great, but ...]

Anyway, we had several codes here that would make a node go south
because of either type of memory problem, and subsequently the
program would die, or the other processes in other nodes would
continue "running" [i.e. mostly waiting for MPI calls to the
dead node that would never return] as you described.

If the problem is benign, i.e., if it is just that the
memory-per-processor is not large enough to run in 10 processors,
you can get around it by running in, say, 20 processors.

Yet another issue that you may check is the stacksize in the
compute nodes.  Many codes require a large stacksize, i.e.,
they create large arrays in subroutines, etc, and
the default stacksize in standard Linux distributions
may not be as large as needed.
We use ulimited stacksize in our compute nodes.

You can ask the system administrator to check this for you,
and perhaps change it in /etc/security/limits.conf to make it
unlimited or at least larger than the default.
The Linux shell command "ulimit -a" [bash] or
"limit" [tcsh] will tell what the limits are.

I hope this helps,
Gus Correa

On 08/31/2012 07:15 PM, Andrea Negri wrote:

Hi, I have been in trouble for a year.

I run a pure MPI (no openMP) Fortran fluid dynamical code on a cluster
of server, and I obtain a strange behaviour by running the code on
multiple nodes.
The cluster is formed by 16 pc (1 pc is a node) with a dual core processor.
Basically, I'm able to run the code from the login node with the command:
mpirun  --mca btl_base_verbose 100 --mca backtrace_base_verbose 100
--mca memory_base_verbose 100 --mca sysinfo_base_verbose 100  -nolocal
-hostfile ./host_file -n 10  ./zeusmp2.x>>  zmp_errors 2>&1
by selecting one process per core (i.e. in this case I use 5 nodes)

The code starts, and it runs correctely for some time.
After that, an entire node (sometimes two) "disappears" and it cannot
be reached with the ssh command, which returns: No route to host.
Sometimes the node is still reachable, but the two processes that was
running on the node are disappears.
In addition, on the other nodes, the others processes are still running.

If I have a look on the output and error file of mpirun, the following
error is present: [btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: No route to host (113)

PS: I'm not the admin of the cluster, I've installed the gcc and
openmpi on my own because the complier aviable on that machine are 8
years old.


I post here some information, if you want other info, you have only to
tell me which command I have to type on the bash and I will
immediately reply.


complier: gcc 4.7 (which was also used to compile openmpi)
openmpi version: 1.6

output of "ompi_info --all" from the node where I launch mpirun (which
is also the login node of the cluster)

   Package: Open MPI and...@cloud.bo.astro.it Distribution
 Open MPI: 1.6
Open MPI SVN revision: r26429
Open MPI release date: May 10, 2012
 Open RTE: 1.6
Open RTE SVN revision: r26429
Open RTE release date: May 10, 2012
 OPAL: 1.6
OPAL SVN revision: r26429
OPAL release date: May 10, 2012
  MPI API: 2.1
 Ident string: 1.6
MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.6)
   MCA memory: linux (MCA v2.0, API v2.0, Component v1.6)
MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6)
MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.6)
MCA carto: file (MCA v2.0, API v2.0, Component v1.6)
MCA shmem: mmap (MCA v2.0, API v2.0, Component v1.6)
MCA shmem: posix (MCA v2.0, API v2.0, Component v1.6)
MCA shmem: sysv (MCA v2.0, API v2.0, Component v1.6)
MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6)
MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6)