Grzegorz, sometimes when a parallel application quits there are
processes left running on the compute nodes. You can usually find
these by running 'pgrep -P 1' and excluding any processes owned by
root.
These 'orphan' processes use up memory - so if you are having problems
with applications quitting like you do it is worth looking at all
nodes and making sure that there are no orphan processes.

But, as you say, it does not happen very often.


On 27/03/2012, Grzegorz Maj <ma...@wp.pl> wrote:
> John, thank you for your reply.
>
> I checked the system logs and there are no signs of oom killer.
>
> What do you mean by cleaning 'orphan' processes? Should I check if
> there are any processes left after each job execution? I have always
> been assuming that when mpirun terminates, everything is cleaned up.
> Currently there are no processes left on the nodes. The failure
> happend on Friday and after that tens of similar jobs completed
> successfully.
>
> Regards,
> Grzegorz Maj
>
> 2012/3/27 John Hearns <hear...@googlemail.com>:
>> Have you checked the system logs on the machines where this is running?
>> Is it perhaps that the processes use lots of memory and the Out Of
>> Memory (OOM) killer is killing them?
>> Also check all nodes for left-over 'orphan' processes which are still
>> running after a job finishes - these should be killed or the node
>> rebooted.
>>
>> On 27/03/2012, Grzegorz Maj <ma...@wp.pl> wrote:
>>> Hi,
>>> I have an MPI application using ScaLAPACK routines. I'm running it on
>>> OpenMPI 1.4.3. I'm using mpirun to launch less than 100 processes. I'm
>>> using it quite extensively for almost two years and it almost always
>>> works fine. However, once every 3-4 months I get the following error
>>> during the execution:
>>>
>>> --------------------------------------------------------------------------
>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>> launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>> the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>> below. Additional manual cleanup may be required - please refer to
>>> the "orte-clean" tool for assistance.
>>> --------------------------------------------------------------------------
>>>
>>> It says that the daemon died while attempting to launch, but my
>>> application (MPI grid) was running for about 14 minutes before it
>>> failed. I can say that based on the log messages I'm producing during
>>> the execution of my application. There is no more information from
>>> mpirun. One more thing I know is that mpirun exit status was 1, but I
>>> guess it is not very helpful. There are no core files.
>>>
>>> I would appreciate any suggestions on how to debug this issue.
>>>
>>> Regards,
>>> Grzegorz Maj
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to