Terry,


I was under the impression that all connections are made because of the
nature of the program that OpenMPI is invoking. LS-DYNA is a finite element
solver and for any given simulation I run, the cores on each node must
constantly communicate with one another to check for various occurrences
(contact with various pieces/parts, updating nodal coordinates, etc.).



I've run the program using --mca mpi_preconnect_mpi 1 and the simulation has
started itself up successfully which I think means that the mpi_preconnect
passed since all of the child processes have started up on each individual
node. Thanks for the suggestion though, it's a good place to start.



I've been worried (though I have no basis for it) that messages may be
getting queued up and hitting some kind of ceiling or timeout. As a finite
element code, I think the communication occurs on a large scale. Lots of
very small packets going back and forth quickly. A few studies have been
done by the High Performance Computing Advisory Council
(http://www.hpcadvisorycouncil.com/pdf/LS-DYNA%20_analysis.pdf) and they've
suggested that LS-DYNA communicates at very, very high rates (Not sure but
from pg.15 of that document they're suggesting hundreds of millions of
messages in only a few hours). Is there any kind of buffer or queue that
OpenMPI develops if messages are created too quickly? Does it dispatch them
immediately or does it attempt to apply some kind of traffic flow control?



Regards,

Robert Walters



  _____  

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Terry Dontje
Sent: Monday, May 02, 2011 1:45 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] OpenMPI LS-DYNA Connection refused



On 05/02/2011 01:27 PM, Robert Walters wrote: 

Open-MPI Users,



I've been using OpenMPI for a while now and am very pleased with it. I use
the OpenMPI system across eight Red Hat Linux nodes (8 cores each) on 1 Gbps
Ethernet behind a dedicated switch. After working out kinks in the
beginning, we've been using it periodically anywhere from 8 cores to 64
cores. We use a finite element software named LS-DYNA. We do not have source
code for this program, it is compiled to work with OpenMPI 1.4.1 (I use
1.4.2) and we cannot make changes or request code to see how it performs
certain functions.



>From time to time, I will be simulating a particular "job" in LS-DYNA and
for some reason, it will quit OpenMPI issuing a MPI_ABORT command stating
that "connect to address xx.xxx.xxx.xxx port xxx: Connection refused; trying
normal rsh (/usr/bin/rsh)." This error comes after running for hours, which
means that connections to the node it's citing have already been made
previously. The particular node it names is random and changes from
simulation to simulation. We use SSH to communicate and we have the ports
open for node-to-node communications on any port. 

I am curious what makes you think the connections to the node its citing
have been made?  Are you sure the connection between two processes have been
made?





Does any user have experience with this error where a connection is
established, and used for several hours, but after a seemingly random period
of time the program dies stating it can't make a connection?

Have you tried running the code giving mpirun the "-mca mpi_preconnect_mpi
1" option?  This will try (it isn't complete but close) to establish all
connections at the start of the job.

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com





Reply via email to