[OMPI users] MPI daemon died unexpectedly

2012-03-27 Thread Grzegorz Maj
Hi,
I have an MPI application using ScaLAPACK routines. I'm running it on
OpenMPI 1.4.3. I'm using mpirun to launch less than 100 processes. I'm
using it quite extensively for almost two years and it almost always
works fine. However, once every 3-4 months I get the following error
during the execution:

--
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--

It says that the daemon died while attempting to launch, but my
application (MPI grid) was running for about 14 minutes before it
failed. I can say that based on the log messages I'm producing during
the execution of my application. There is no more information from
mpirun. One more thing I know is that mpirun exit status was 1, but I
guess it is not very helpful. There are no core files.

I would appreciate any suggestions on how to debug this issue.

Regards,
Grzegorz Maj


Re: [OMPI users] MPI daemon died unexpectedly

2012-03-27 Thread John Hearns
Have you checked the system logs on the machines where this is running?
Is it perhaps that the processes use lots of memory and the Out Of
Memory (OOM) killer is killing them?
Also check all nodes for left-over 'orphan' processes which are still
running after a job finishes - these should be killed or the node
rebooted.

On 27/03/2012, Grzegorz Maj  wrote:
> Hi,
> I have an MPI application using ScaLAPACK routines. I'm running it on
> OpenMPI 1.4.3. I'm using mpirun to launch less than 100 processes. I'm
> using it quite extensively for almost two years and it almost always
> works fine. However, once every 3-4 months I get the following error
> during the execution:
>
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> --
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --
>
> It says that the daemon died while attempting to launch, but my
> application (MPI grid) was running for about 14 minutes before it
> failed. I can say that based on the log messages I'm producing during
> the execution of my application. There is no more information from
> mpirun. One more thing I know is that mpirun exit status was 1, but I
> guess it is not very helpful. There are no core files.
>
> I would appreciate any suggestions on how to debug this issue.
>
> Regards,
> Grzegorz Maj
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI daemon died unexpectedly

2012-03-27 Thread Grzegorz Maj
John, thank you for your reply.

I checked the system logs and there are no signs of oom killer.

What do you mean by cleaning 'orphan' processes? Should I check if
there are any processes left after each job execution? I have always
been assuming that when mpirun terminates, everything is cleaned up.
Currently there are no processes left on the nodes. The failure
happend on Friday and after that tens of similar jobs completed
successfully.

Regards,
Grzegorz Maj

2012/3/27 John Hearns :
> Have you checked the system logs on the machines where this is running?
> Is it perhaps that the processes use lots of memory and the Out Of
> Memory (OOM) killer is killing them?
> Also check all nodes for left-over 'orphan' processes which are still
> running after a job finishes - these should be killed or the node
> rebooted.
>
> On 27/03/2012, Grzegorz Maj  wrote:
>> Hi,
>> I have an MPI application using ScaLAPACK routines. I'm running it on
>> OpenMPI 1.4.3. I'm using mpirun to launch less than 100 processes. I'm
>> using it quite extensively for almost two years and it almost always
>> works fine. However, once every 3-4 months I get the following error
>> during the execution:
>>
>> --
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> --
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>> --
>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --
>>
>> It says that the daemon died while attempting to launch, but my
>> application (MPI grid) was running for about 14 minutes before it
>> failed. I can say that based on the log messages I'm producing during
>> the execution of my application. There is no more information from
>> mpirun. One more thing I know is that mpirun exit status was 1, but I
>> guess it is not very helpful. There are no core files.
>>
>> I would appreciate any suggestions on how to debug this issue.
>>
>> Regards,
>> Grzegorz Maj
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] MPI daemon died unexpectedly

2012-03-27 Thread John Hearns
Grzegorz, sometimes when a parallel application quits there are
processes left running on the compute nodes. You can usually find
these by running 'pgrep -P 1' and excluding any processes owned by
root.
These 'orphan' processes use up memory - so if you are having problems
with applications quitting like you do it is worth looking at all
nodes and making sure that there are no orphan processes.

But, as you say, it does not happen very often.


On 27/03/2012, Grzegorz Maj  wrote:
> John, thank you for your reply.
>
> I checked the system logs and there are no signs of oom killer.
>
> What do you mean by cleaning 'orphan' processes? Should I check if
> there are any processes left after each job execution? I have always
> been assuming that when mpirun terminates, everything is cleaned up.
> Currently there are no processes left on the nodes. The failure
> happend on Friday and after that tens of similar jobs completed
> successfully.
>
> Regards,
> Grzegorz Maj
>
> 2012/3/27 John Hearns :
>> Have you checked the system logs on the machines where this is running?
>> Is it perhaps that the processes use lots of memory and the Out Of
>> Memory (OOM) killer is killing them?
>> Also check all nodes for left-over 'orphan' processes which are still
>> running after a job finishes - these should be killed or the node
>> rebooted.
>>
>> On 27/03/2012, Grzegorz Maj  wrote:
>>> Hi,
>>> I have an MPI application using ScaLAPACK routines. I'm running it on
>>> OpenMPI 1.4.3. I'm using mpirun to launch less than 100 processes. I'm
>>> using it quite extensively for almost two years and it almost always
>>> works fine. However, once every 3-4 months I get the following error
>>> during the execution:
>>>
>>> --
>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>> launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>> the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --
>>> --
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --
>>> --
>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>> below. Additional manual cleanup may be required - please refer to
>>> the "orte-clean" tool for assistance.
>>> --
>>>
>>> It says that the daemon died while attempting to launch, but my
>>> application (MPI grid) was running for about 14 minutes before it
>>> failed. I can say that based on the log messages I'm producing during
>>> the execution of my application. There is no more information from
>>> mpirun. One more thing I know is that mpirun exit status was 1, but I
>>> guess it is not very helpful. There are no core files.
>>>
>>> I would appreciate any suggestions on how to debug this issue.
>>>
>>> Regards,
>>> Grzegorz Maj
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-03-27 Thread Rodrigo Oliveira
Hi Edgar.

Thanks for the response. I just did not understand why the Barrier works
before I remove one of the client processes.

I tryed it with 1 server and 3 clients and it worked properly. After I
removed 1 of the clients, it stops working. So, the removal is affecting
the functionality of Barrier, I guess.

Anyone has an idea?

On Mon, Mar 26, 2012 at 12:34 PM, Edgar Gabriel  wrote:

> I do not recall on what the agreement was on how to treat the size=1


Re: [OMPI users] Data distribution on different machines

2012-03-27 Thread Jeffrey Squyres
You might want to take an MPI tutorial or two; there's a few good ones 
available on the net.

My favorites are the basic and intermediate level MPI tutorials at NCSA.


On Mar 25, 2012, at 1:13 PM, Rohan Deshpande wrote:

> Hi,
> 
> I want to distribute the data on different machines using open mpi. 
> 
> I am a new user. Can some one point me to the resources or atleast functions 
> i would have to use to complete the task?
> 
> I am using red hat linux.
> 
> Thanks,
> 
> -- 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] oMPI hang with IB question

2012-03-27 Thread Jeffrey Squyres
Dylan --

Sorry for the delay in replying.

On an offhand guess, does the problem go away if you run with:

  --mca mpi_leave_pinned 0

?


On Mar 20, 2012, at 3:35 PM, Dylan Nelson wrote:

> Hello,
> 
> I've been having trouble for awhile now running some OpenMPI+IB jobs on
> multiple tasks. The problems are all "hangs" and are not reproducible - the
> same execution started again will in general proceed just fine where
> previously it got stuck, but then later get stuck. These stuck processes are
> pegged at 100% CPU usage and remain there for days if not killed.
> 
> The same nature of problem exists in oMPI 1.2.5, 1.4.2, and 1.5.3 (for the
> code I am running). This is quite possible some problem in the
> configuration/cluster, I am not claiming that it is a bug in oMPI, but was
> just hopeful that someone might have a guess as to what is going on.
> 
> In ancient 1.2.5 the problem manifests as (I attach gdb to the stalled
> process on one of the child nodes):
> 
> 
> 
> (gdb) bt
> #0  0x2b8135b3f699 in ibv_cmd_create_qp () from
> /usr/lib64/libmlx4-rdmav2.so
> #1  0x2b8135b3faa6 in ibv_cmd_create_qp () from
> /usr/lib64/libmlx4-rdmav2.so
> #2  0x2b813407bff1 in btl_openib_component_progress ()
>   from /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_btl_openib.so
> #3  0x2b8133e6f04a in mca_bml_r2_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_bml_r2.so
> #4  0x2b812f52c9ba in opal_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0
> #5  0x2b812f067b05 in ompi_request_wait_all () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0
> #6  0x in ?? ()
> (gdb) next
> Single stepping until exit from function ibv_cmd_create_qp, which has no
> line number information.
> 0x2b8135b3f358 in pthread_spin_unlock@plt () from
> /usr/lib64/libmlx4-rdmav2.so
> (gdb) next
> Single stepping until exit from function pthread_spin_unlock@plt, which has
> no line number information.
> 0x0038c860b760 in pthread_spin_unlock () from /lib64/libpthread.so.0
> (gdb) next
> Single stepping until exit from function pthread_spin_unlock, which has no
> line number information.
> 0x2b8135b3fc21 in ibv_cmd_create_qp () from /usr/lib64/libmlx4-rdmav2.so
> (gdb) next
> Single stepping until exit from function ibv_cmd_create_qp, which has no
> line number information.
> 0x2b813407bff1 in btl_openib_component_progress ()
>   from /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_btl_openib.so
> (gdb) next
> Single stepping until exit from function btl_openib_component_progress,
> which has no line number information.
> 0x2b8133e6f04a in mca_bml_r2_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_bml_r2.so
> (gdb) next
> Single stepping until exit from function mca_bml_r2_progress, which has no
> line number information.
> 0x2b812f52c9ba in opal_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0
> (gdb) next
> Single stepping until exit from function opal_progress, which has no line
> number information.
> 0x2b812f067b05 in ompi_request_wait_all () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0
> (gdb) next
> Single stepping until exit from function ompi_request_wait_all, which has no
> line number information.
> 
> ---hang--- (infinite loop?)
> 
> On a different task:
> 
> 0x2ba2383b4982 in opal_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0
> (gdb) bt
> #0  0x2ba2383b4982 in opal_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0
> #1  0x2ba237eefb05 in ompi_request_wait_all () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0
> #2  0x in ?? ()
> (gdb) next
> Single stepping until exit from function opal_progress, which has no line
> number information.
> 0x2ba237eefb05 in ompi_request_wait_all () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0
> (gdb) next
> Single stepping until exit from function ompi_request_wait_all, which has no
> line number information.
> 
> ---hang---
> 
> 
> 
> On 1.5.3 a similar "hang" problem happens but the backtrace goes back to the
> original code call which is a MPI_Sendrecv():
> 
> 
> 
> 3510OPAL_THREAD_UNLOCK(&endpoint->eager_rdma_local.lock);
> (gdb) bt
> #0  progress_one_device () at btl_openib_component.c:3510
> #1  btl_openib_component_progress () at btl_openib_component.c:3541
> #2  0x2b722f348b35 in opal_progress () at runtime/opal_progress.c:207
> #3  0x2b722f287025 in opal_condition_wait (buf=0x2b636298,
> count=251328, datatype=0x6ef240, dst=12, tag=35,
>sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x6ee430) at
> ../../../../opal/threads/condition.h:99
> #4  ompi_request_wait_completion (buf=0x2b636298, count=251328,
> datatype

[OMPI users] Can not run a parallel job on all the nodes in the cluster

2012-03-27 Thread Hameed Alzahrani

Hi, 

When I run any parallel job I get the answer just from the submitting node even 
when I tried to benchmark the cluster using LINPACK but it look like the job 
just working on the submitting node is there a way to make openMPI send the job 
equally to all the nodes depending on the number of processor in the current 
mode even if I specify that the job should use 8 processor it look like openMPI 
use the submitting node 4 processors instead of using the other processors. I 
tried also --host but it does not work correctly in benchmarking the cluster so 
does any one use openMPI in benchmarking a cluster or does any one knows how to 
make openMPI divids the parallel job equally to every processor on the cluster.

Regards,