I've attached the typical error message I've been getting. This is from a
run I initiated this morning. The first few lines or so are related to the
LS-DYNA program and are just there to let you know its running successfully
for an hour and a half.



What's interesting is this doesn't happen on every job I run, and will recur
for the same simulation. For instance, Simulation A will run for 40 hours,
and complete successfully. Simulation B will run for 6 hours, and die from
an error. Any further attempts to run simulation B will always end from an
error. This makes me think there is some kind of bad calculation happening
that OpenMPI doesn't know how to handle, or LS-DYNA doesn't know how to pass
to OpenMPI. On the other hand, this particular simulation is one of those
"benchmarks" and everyone runs it. I should not be getting errors from the
FE code itself. Odd. I think I'll try this as an SMP job as well as an MPP
job over a single node and see if the issue continues. That way I can figure
out if its OpenMPI related or FE code related, but as I mentioned, I don't
think it is FE code related since others have successfully run this
particular benchmarking simulation.



Error Message:

 Parallel execution with     56 MPP proc

 NLQ used/max               152/   152

 Start time   05/02/2011 10:02:20  

 End time     05/02/2011 11:24:46  

 Elapsed time    4946 seconds(  1 hours 22 min. 26 sec.) for    9293 cycles



 E r r o r   t e r m i n a t i o n

--------------------------------------------------------------------------

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 

with errorcode -1525207032.



NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.

--------------------------------------------------------------------------

connect to address xx.xxx.xx.xxx port 544: Connection refused

connect to address xx.xxx.xx.xxx port 544: Connection refused

trying normal rsh (/usr/bin/rsh)

--------------------------------------------------------------------------

mpirun has exited due to process rank 0 with PID 24488 on

node allision exiting without calling "finalize". This may

have caused other processes in the application to be

terminated by signals sent by mpirun (as reported here).

--------------------------------------------------------------------------



Regards,

Robert Walters

  _____  

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Terry Dontje
Sent: Monday, May 02, 2011 2:50 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] OpenMPI LS-DYNA Connection refused



On 05/02/2011 02:04 PM, Robert Walters wrote: 

Terry,



I was under the impression that all connections are made because of the
nature of the program that OpenMPI is invoking. LS-DYNA is a finite element
solver and for any given simulation I run, the cores on each node must
constantly communicate with one another to check for various occurrences
(contact with various pieces/parts, updating nodal coordinates, etc.).



You might be right, the connections might have been established but the
error message you state (connection refused) seems out of place if the
connection was already established.

Was there more error messages from OMPI other than "connection refused"?  If
so could you possibly provide that output to us, maybe it will give us a
hint where in the library things are messing up.



I've run the program using --mca mpi_preconnect_mpi 1 and the simulation has
started itself up successfully which I think means that the mpi_preconnect
passed since all of the child processes have started up on each individual
node. Thanks for the suggestion though, it's a good place to start.

Yeah, it possibly could be telling if things do work with this setting.





I've been worried (though I have no basis for it) that messages may be
getting queued up and hitting some kind of ceiling or timeout. As a finite
element code, I think the communication occurs on a large scale. Lots of
very small packets going back and forth quickly. A few studies have been
done by the High Performance Computing Advisory Council
(http://www.hpcadvisorycouncil.com/pdf/LS-DYNA%20_analysis.pdf) and they've
suggested that LS-DYNA communicates at very, very high rates (Not sure but
from pg.15 of that document they're suggesting hundreds of millions of
messages in only a few hours). Is there any kind of buffer or queue that
OpenMPI develops if messages are created too quickly? Does it dispatch them
immediately or does it attempt to apply some kind of traffic flow control?

The queuing really depends on what type of calls the application is making.
If it is doing blocking sends then I wouldn't expect too much queuing
happening using the tcp btl.  As far as traffic flow control is concerned I
believe the tcp btl doesn't do any for the most part and lets tcp handle
that.  Maybe someone else on the list could chime in if I am wrong here.

In the past I have seen where lots of traffic on the network and to a
particular node has cause some connections not to be established.  But I
don't know of any outstanding issue with such issues right now.

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com





Reply via email to