Re: [OMPI users] Problem including C MPI code from C++ using C linkage

2010-09-02 Thread Jeff Squyres
On Aug 31, 2010, at 5:39 PM, Patrik Jonsson wrote:

> It seems a bit presumptuous of mpi.h to just include mpicxx.h just
> because __cplusplus is defined, since that makes it impossible to link
> C MPI code from C++.

The MPI standard requires that  work in both C and C++ applications.  It 
also requires that  include all the C++ binding prototypes when 
relevant.  Hence, there's not much we can do here.

> I've had to resort to something like
> 
> #ifdef __cplusplus
> #undef __cplusplus
> #include 
> #define __cplusplus
> #else
> #include 
> #endif

As you noted, that doesn't seem like a good idea.

> in c-code.h, which seems to work but isn't exactly smooth. Is there
> another way around this, or has linking C MPI code with C++ never come
> up before?

Just to be clear: this isn't a linking issue; it's a compiling issue.  

As Lisandro noted, it's probably best to separate  outside of your 
 file.

Or, you can make your  file be safe for C++ by doing something like 
in c-code.h:

#include 

#ifdef __cplusplus
#extern "C" {
#endif
...all your C declarations...
#ifdef __cplusplus
}
#endif

This is probably preferable because then your  is safe for both C and 
C++, and you keep  contained inside it (assumedly preserving some 
abstraction barriers in your code by keeping the MPI prototypes bundled with 
).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] simplest way to check message queues

2010-09-02 Thread Brock Palen
Ah ok, I put it there just because the user couldn't read that from my home 
space, and never even thought of that.  gahhh.

Thanks,

BTW I tried joining the padb mailing list.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Sep 1, 2010, at 6:11 PM, Ashley Pittman wrote:

> 
> padb as a binary (it's a perl script) needs to exist on all nodes as it calls 
> orterun on itself, try installing it to a shared directory or copying padb to 
> /tmp on every node.
> 
> To access the message queues padb needs a compiled helper program which is 
> installed in $PREFIX/lib so I would recommend re-building padb giving it a 
> prefix of a NFS shared directory.  I can help you more with this if required.
> 
> Ashley,
> 
> On 1 Sep 2010, at 23:01, Brock Palen wrote:
> 
>> We have ddt, but we do not have licenses to attach to the number of cores 
>> these jobs run at.
>> 
>> I tried padb,  but it fails, 
>> 
>> Example:
>> 
>> ssh to root node for running MPI job:
>> /tmp/padb -Q -a
>> 
>> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
>> Communication retries exceeded.  Can not communicate with peer
>> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in 
>> file util/comm/comm.c at line 62
>> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in 
>> file orte-ps.c at line 799
>> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
>> Communication retries exceeded.  Can not communicate with peer
>> einner: 
>> --
>> einner: orterun was unable to launch the specified application as it could 
>> not access
>> einner: or execute an executable:
>> Unexpected EOF from Inner stdout (connecting)
>> Unexpected EOF from Inner stderr (connecting)
>> Unexpected exit from parallel command (state=connecting)
>> Bad exit code from parallel command (exit_code=131)
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] simplest way to check message queues

2010-09-02 Thread Brock Palen
Ashly still having trouble using padb with openmpi/1.4.2

[dianawon@nyx0862 ~]$ /home/software/rhel5/padb/3.0/padb -a -Q
[nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: 
Communication retries exceeded.  Can not communicate with peer
[nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in 
file util/comm/comm.c at line 62
[nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in 
file orte-ps.c at line 799
[nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: 
Communication retries exceeded.  Can not communicate with peer
No active jobs could be found for user 'dianawon'


The job is running, I get this error running just orte-ps, 

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Sep 2, 2010, at 9:47 AM, Brock Palen wrote:

> Ah ok, I put it there just because the user couldn't read that from my home 
> space, and never even thought of that.  gahhh.
> 
> Thanks,
> 
> BTW I tried joining the padb mailing list.
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Sep 1, 2010, at 6:11 PM, Ashley Pittman wrote:
> 
>> 
>> padb as a binary (it's a perl script) needs to exist on all nodes as it 
>> calls orterun on itself, try installing it to a shared directory or copying 
>> padb to /tmp on every node.
>> 
>> To access the message queues padb needs a compiled helper program which is 
>> installed in $PREFIX/lib so I would recommend re-building padb giving it a 
>> prefix of a NFS shared directory.  I can help you more with this if required.
>> 
>> Ashley,
>> 
>> On 1 Sep 2010, at 23:01, Brock Palen wrote:
>> 
>>> We have ddt, but we do not have licenses to attach to the number of cores 
>>> these jobs run at.
>>> 
>>> I tried padb,  but it fails, 
>>> 
>>> Example:
>>> 
>>> ssh to root node for running MPI job:
>>> /tmp/padb -Q -a
>>> 
>>> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
>>> Communication retries exceeded.  Can not communicate with peer
>>> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable 
>>> in file util/comm/comm.c at line 62
>>> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable 
>>> in file orte-ps.c at line 799
>>> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
>>> Communication retries exceeded.  Can not communicate with peer
>>> einner: 
>>> --
>>> einner: orterun was unable to launch the specified application as it could 
>>> not access
>>> einner: or execute an executable:
>>> Unexpected EOF from Inner stdout (connecting)
>>> Unexpected EOF from Inner stderr (connecting)
>>> Unexpected exit from parallel command (state=connecting)
>>> Bad exit code from parallel command (exit_code=131)
>> 
>> -- 
>> 
>> Ashley Pittman, Bath, UK.
>> 
>> Padb - A parallel job inspection tool for cluster computing
>> http://padb.pittman.org.uk
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] simplest way to check message queues

2010-09-02 Thread Ashley Pittman

On 2 Sep 2010, at 15:56, Brock Palen wrote:

> Ashly still having trouble using padb with openmpi/1.4.2
> 
> [dianawon@nyx0862 ~]$ /home/software/rhel5/padb/3.0/padb -a -Q
> [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: 
> Communication retries exceeded.  Can not communicate with peer
> [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in 
> file util/comm/comm.c at line 62
> [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in 
> file orte-ps.c at line 799
> [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: 
> Communication retries exceeded.  Can not communicate with peer
> No active jobs could be found for user 'dianawon'
> 
> The job is running, I get this error running just orte-ps, 

If orte-ps isn't running correctly then there is very little padb can do, if 
that is the case try using the "mpirun" resource manager interface rather than 
"orte", this will cause padb to use the MPIR interface and try to get the 
information directly from the mpirun process before launching itself via pdsh.  
It doesn't scale as well as the orte integration (pdsh runs out of file 
descriptors eventually) but is more generic and might get you to somewhere that 
works.  If your job spans more than 32 nodes you may need to set the FANOUT 
variable for pdsh to work.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] simplest way to check message queues

2010-09-02 Thread Ashley Pittman
On 1 Sep 2010, at 23:32, Jaison Mulerikkal wrote:

> Hi,
> 
> I am getting interested in this thread.
> 
> I'm looking for some solutions, where I can redirect a task/message 
> (MPI_send) to a particular process (say rank 1), which is in a queue (at rank 
> 1) to another process (say rank 2), if the queue is longer at rank 1. 
> 
> How can I do it?
> 
> First of all, I need to know the queue length at a particular process (rank 
> 1) at a particular instant. how can I use padb to get that info?
> 
> Then on the basis of that info 'send'  some (queued up) messages (from rank 
> 1) to some other process (say rank 2) which are relatively free. Is that 
> possible?


The tools being discussed are for querying the state of message queues within a 
parallel job from outside of that job and are not suitable for the type of 
introspection you are talking about.

It sounds like you are looking for some kind of shared receive queue which 
multiple ranks can pull messages off, I can't think of anything in MPI that 
would allow this kind of functionality short of having a RTS/CTS protocol in 
the application layer.  The easiest might be to had a single rank receive all 
messages and keep them in a queue and then use MPI_Ssend() to forward messages 
to your "consumer" ranks.  Substitute ranks for threads in the above text as 
you feel is appropriate.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




[OMPI users] spin-wait backoff

2010-09-02 Thread David Singleton


I'm sure this has been discussed before but having watched hundreds of
thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd
be keen to know why there isn't some sort of "spin-wait backoff" option.
For example, a way to specify spin-wait for x seconds/cycles/iterations
then backoff to lighter and lighter cpu usage.  At least that way, hung
jobs would become self-evident.

Maybe there is already some way of doing this?

Thanks,
David



Re: [OMPI users] spin-wait backoff

2010-09-02 Thread Douglas Guptill
Hi David:

On Fri, Sep 03, 2010 at 10:50:02AM +1000, David Singleton wrote:
>
> I'm sure this has been discussed before but having watched hundreds of
> thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd
> be keen to know why there isn't some sort of "spin-wait backoff" option.
> For example, a way to specify spin-wait for x seconds/cycles/iterations
> then backoff to lighter and lighter cpu usage.  At least that way, hung
> jobs would become self-evident.
>
> Maybe there is already some way of doing this?

For my solution to this, see

  http://www.open-mpi.org/community/lists/users/2010/07/13731.php

HTH,
Douglas.
-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada