Re: [OMPI users] Problem including C MPI code from C++ using C linkage
On Aug 31, 2010, at 5:39 PM, Patrik Jonsson wrote: > It seems a bit presumptuous of mpi.h to just include mpicxx.h just > because __cplusplus is defined, since that makes it impossible to link > C MPI code from C++. The MPI standard requires that work in both C and C++ applications. It also requires that include all the C++ binding prototypes when relevant. Hence, there's not much we can do here. > I've had to resort to something like > > #ifdef __cplusplus > #undef __cplusplus > #include > #define __cplusplus > #else > #include > #endif As you noted, that doesn't seem like a good idea. > in c-code.h, which seems to work but isn't exactly smooth. Is there > another way around this, or has linking C MPI code with C++ never come > up before? Just to be clear: this isn't a linking issue; it's a compiling issue. As Lisandro noted, it's probably best to separate outside of your file. Or, you can make your file be safe for C++ by doing something like in c-code.h: #include #ifdef __cplusplus #extern "C" { #endif ...all your C declarations... #ifdef __cplusplus } #endif This is probably preferable because then your is safe for both C and C++, and you keep contained inside it (assumedly preserving some abstraction barriers in your code by keeping the MPI prototypes bundled with ). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] simplest way to check message queues
Ah ok, I put it there just because the user couldn't read that from my home space, and never even thought of that. gahhh. Thanks, BTW I tried joining the padb mailing list. Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On Sep 1, 2010, at 6:11 PM, Ashley Pittman wrote: > > padb as a binary (it's a perl script) needs to exist on all nodes as it calls > orterun on itself, try installing it to a shared directory or copying padb to > /tmp on every node. > > To access the message queues padb needs a compiled helper program which is > installed in $PREFIX/lib so I would recommend re-building padb giving it a > prefix of a NFS shared directory. I can help you more with this if required. > > Ashley, > > On 1 Sep 2010, at 23:01, Brock Palen wrote: > >> We have ddt, but we do not have licenses to attach to the number of cores >> these jobs run at. >> >> I tried padb, but it fails, >> >> Example: >> >> ssh to root node for running MPI job: >> /tmp/padb -Q -a >> >> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: >> Communication retries exceeded. Can not communicate with peer >> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in >> file util/comm/comm.c at line 62 >> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in >> file orte-ps.c at line 799 >> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: >> Communication retries exceeded. Can not communicate with peer >> einner: >> -- >> einner: orterun was unable to launch the specified application as it could >> not access >> einner: or execute an executable: >> Unexpected EOF from Inner stdout (connecting) >> Unexpected EOF from Inner stderr (connecting) >> Unexpected exit from parallel command (state=connecting) >> Bad exit code from parallel command (exit_code=131) > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >
Re: [OMPI users] simplest way to check message queues
Ashly still having trouble using padb with openmpi/1.4.2 [dianawon@nyx0862 ~]$ /home/software/rhel5/padb/3.0/padb -a -Q [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: Communication retries exceeded. Can not communicate with peer [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in file util/comm/comm.c at line 62 [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in file orte-ps.c at line 799 [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: Communication retries exceeded. Can not communicate with peer No active jobs could be found for user 'dianawon' The job is running, I get this error running just orte-ps, Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On Sep 2, 2010, at 9:47 AM, Brock Palen wrote: > Ah ok, I put it there just because the user couldn't read that from my home > space, and never even thought of that. gahhh. > > Thanks, > > BTW I tried joining the padb mailing list. > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > bro...@umich.edu > (734)936-1985 > > > > On Sep 1, 2010, at 6:11 PM, Ashley Pittman wrote: > >> >> padb as a binary (it's a perl script) needs to exist on all nodes as it >> calls orterun on itself, try installing it to a shared directory or copying >> padb to /tmp on every node. >> >> To access the message queues padb needs a compiled helper program which is >> installed in $PREFIX/lib so I would recommend re-building padb giving it a >> prefix of a NFS shared directory. I can help you more with this if required. >> >> Ashley, >> >> On 1 Sep 2010, at 23:01, Brock Palen wrote: >> >>> We have ddt, but we do not have licenses to attach to the number of cores >>> these jobs run at. >>> >>> I tried padb, but it fails, >>> >>> Example: >>> >>> ssh to root node for running MPI job: >>> /tmp/padb -Q -a >>> >>> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: >>> Communication retries exceeded. Can not communicate with peer >>> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable >>> in file util/comm/comm.c at line 62 >>> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable >>> in file orte-ps.c at line 799 >>> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: >>> Communication retries exceeded. Can not communicate with peer >>> einner: >>> -- >>> einner: orterun was unable to launch the specified application as it could >>> not access >>> einner: or execute an executable: >>> Unexpected EOF from Inner stdout (connecting) >>> Unexpected EOF from Inner stderr (connecting) >>> Unexpected exit from parallel command (state=connecting) >>> Bad exit code from parallel command (exit_code=131) >> >> -- >> >> Ashley Pittman, Bath, UK. >> >> Padb - A parallel job inspection tool for cluster computing >> http://padb.pittman.org.uk >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >
Re: [OMPI users] simplest way to check message queues
On 2 Sep 2010, at 15:56, Brock Palen wrote: > Ashly still having trouble using padb with openmpi/1.4.2 > > [dianawon@nyx0862 ~]$ /home/software/rhel5/padb/3.0/padb -a -Q > [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: > Communication retries exceeded. Can not communicate with peer > [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in > file util/comm/comm.c at line 62 > [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in > file orte-ps.c at line 799 > [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: > Communication retries exceeded. Can not communicate with peer > No active jobs could be found for user 'dianawon' > > The job is running, I get this error running just orte-ps, If orte-ps isn't running correctly then there is very little padb can do, if that is the case try using the "mpirun" resource manager interface rather than "orte", this will cause padb to use the MPIR interface and try to get the information directly from the mpirun process before launching itself via pdsh. It doesn't scale as well as the orte integration (pdsh runs out of file descriptors eventually) but is more generic and might get you to somewhere that works. If your job spans more than 32 nodes you may need to set the FANOUT variable for pdsh to work. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] simplest way to check message queues
On 1 Sep 2010, at 23:32, Jaison Mulerikkal wrote: > Hi, > > I am getting interested in this thread. > > I'm looking for some solutions, where I can redirect a task/message > (MPI_send) to a particular process (say rank 1), which is in a queue (at rank > 1) to another process (say rank 2), if the queue is longer at rank 1. > > How can I do it? > > First of all, I need to know the queue length at a particular process (rank > 1) at a particular instant. how can I use padb to get that info? > > Then on the basis of that info 'send' some (queued up) messages (from rank > 1) to some other process (say rank 2) which are relatively free. Is that > possible? The tools being discussed are for querying the state of message queues within a parallel job from outside of that job and are not suitable for the type of introspection you are talking about. It sounds like you are looking for some kind of shared receive queue which multiple ranks can pull messages off, I can't think of anything in MPI that would allow this kind of functionality short of having a RTS/CTS protocol in the application layer. The easiest might be to had a single rank receive all messages and keep them in a queue and then use MPI_Ssend() to forward messages to your "consumer" ranks. Substitute ranks for threads in the above text as you feel is appropriate. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
[OMPI users] spin-wait backoff
I'm sure this has been discussed before but having watched hundreds of thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd be keen to know why there isn't some sort of "spin-wait backoff" option. For example, a way to specify spin-wait for x seconds/cycles/iterations then backoff to lighter and lighter cpu usage. At least that way, hung jobs would become self-evident. Maybe there is already some way of doing this? Thanks, David
Re: [OMPI users] spin-wait backoff
Hi David: On Fri, Sep 03, 2010 at 10:50:02AM +1000, David Singleton wrote: > > I'm sure this has been discussed before but having watched hundreds of > thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd > be keen to know why there isn't some sort of "spin-wait backoff" option. > For example, a way to specify spin-wait for x seconds/cycles/iterations > then backoff to lighter and lighter cpu usage. At least that way, hung > jobs would become self-evident. > > Maybe there is already some way of doing this? For my solution to this, see http://www.open-mpi.org/community/lists/users/2010/07/13731.php HTH, Douglas. -- Douglas Guptill voice: 902-461-9749 Research Assistant, LSC 4640 email: douglas.gupt...@dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada