[OMPI users] OpenMpi-java Examples
hi, Iam new to OpenMPI. I have installed the java bindings of OpenMPI and running some samples in the cluster. iam interested in some samples using THREAD_SERIALIZED and THREAD_FUNNLED fields in OpenMPI. please provide me some samples. -- Lokah samasta sukhinobhavanthu Thanks, Madhurima
Re: [OMPI users] Question about '--mca btl tcp,self'
To add on to what Ralph said: 1. There are two different message passing paths in OMPI: - "OOB" (out of band): used for control messages - "BTL" (byte transfer layer): used for MPI traffic (there are actually others, but these seem to be the relevant 2 for your setup) 2. If you don't specify which OOB interfaces to use OMPI will (basically) just pick one. It doesn't really matter which one it uses; the OOB channel doesn't use too much bandwidth, and is mostly just during startup and shutdown. The one exception to this is stdout/stderr routing. If your MPI app writes to stdout/stderr, this also uses the OOB path. So if you output a LOT to stdout, then the OOB interface choice might matter. 3. If you don't specify which MPI interfaces to use, OMPI will basically find the "best" set of interfaces and use those. IP interfaces are always rated less than OS-bypass interfaces (e.g., verbs/IB). Or, as you noticed, you can give a comma-delimited list of BTLs to use. OMPI will then use -- at most -- exactly those BTLs, but definitely no others. Each BTL typically has an additional parameter or parameters that can be used to specify which interfaces to use for the network interface type that that BTL uses. For example, btl_tcp_if_include tells the TCP BTL which interface(s) to use. Also, note that you seem to have missed a BTL: sm (shared memory). sm is the preferred BTL to use for same-server communication. It is much faster than both the TCP loopback device (which OMPI excludes by default, BTW, which is probably why you got reachability errors when you specifying "--mca btl tcp,self") and the verbs (i.e., "openib") BTL for same-server communication. 4. If you don't specify anything, OMPI usually picks the best thing for you. In your case, it'll probably be equivalent to: mpirun --mca btl openib,sm,self ... And the control messages will flow across one of your IP interfaces. 5. If you want to be specific about which one it uses, you can specify oob_tcp_if_include. For example: mpirun --mca oob_tcp_if_include eth0 ... Make sense? On Mar 15, 2014, at 1:18 AM, Jianyu Liu wrote: >> On Mar 14, 2014, at 10:16:34 AM,Jeff Squyres wrote: >> >>> On Mar 14, 2014, at 10:11 AM, Ralph Castain wrote: >>> 1. If specified '--mca btl tcp,self', which interface application will run on, use GigE adaper OR use the OpenFabrics interface in IP over IB mode (just like a high performance GigE adapter) ? >>> >>> Both - ip over ib looks just like an Ethernet adaptor >> >> >> To be clear: the TCP BTL will use all TCP interfaces (regardless of >> underlying physical transport). Your GigE adapter and your IP adapter both >> present IP interfaces to>the OS, and both support TCP. So the TCP BTL will >> use them, because it just sees the TCP/IP interfaces. > > Thanks for your kindly input. > > Please see if I have understood correctly > > Assume there are two nework > Gigabit Ethernet > > eth0-renamed : 192.168.[1-22].[1-14] / 255.255.192.0 > > InfiniBand network > > ib0 : 172.20.[1-22].[1-4] / 255.255.0.0 > > > 1. If specified '--mca btl tcp,self > > The control information ( such as setup and teardown ) are routed to and > passed by Gigabit Ethernet in TCP/IP mode > The MPI messages are routed to and passed by InfiniBand network in IP > over IB mode > On the same machine, the TCP lookback device will be used for passing > control and MPI messages > > 2. If specified '--mca btl tcp,self --mca btl_tcp_if_include ib0' > > Both of control information ( such as setup and teardown ) and MPI > messages are routed to and passed by InfiniBand network in IP over IB mode > On the same machine, The TCP lookback device will be used for passing > control and MPI messages > > > 3. If specified '--mca btl openib,self' > > The control information ( such as setup and teardown ) are routed to and > passed by InfiniBand network in IP over IB mode > The MPI messages are routed to and passed by InfiniBand network in RDMA > mode > On the same machine, the TCP lookback device will be used for passing > control and MPI messages > > > 4. If without specifiying any 'mca btl' parameters > > The control information ( such as setup and teardown ) are routed to and > passed by Gigabit Ethernet in TCP/IP mode > The MPI messages are routed and passed by InfiniBand network in RDMA mode > On the same machine, the shared memory (sm) BTL will be used for control > and MPI passing messages > > > Appreciating your kindly input > > Jianyu > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Compiling Open MPI 1.7.4 using PGI 14.2 and Mellanox HCOLL enabled
Ralph -- it seems to be picking up "-pthread" from libslurm.la (i.e., outside of the OMPI tree), which pgcc doesn't seem to like. Another solution might be to (temporarily?) remove the "-pthread" from libslurm.la (which is a text file that you can edit). Then OMPI shouldn't pick up that flag, and building should be ok. On Mar 16, 2014, at 11:50 AM, Ralph Castain wrote: > If you are running on a Slurm-managed cluster, it won't be happy without > configuring --with-slurm - you won't see the allocation, for one. > > Is it just the --with-slurm option that causes the problem? In other words, > if you remove the rest of those options (starting --with-hcoll and going down > that config line) and leave --with-slurm, does it build? > > On Mar 16, 2014, at 8:22 AM, Filippo Spiga wrote: > >> Hi Jeff, Hi Ake, >> >> removing --with-slurm and keeping --with-hcoll seems to work. The error >> disappears at compile time, I have not yet tried to run a job. I can copy >> config.log and the make.log is needed. >> >> Cheers, >> F >> >> On Mar 11, 2014, at 4:48 PM, Jeff Squyres (jsquyres) >> wrote: >>> On Mar 11, 2014, at 11:22 AM, Åke Sandgren >>> wrote: >>> >> ../configure CC=pgcc CXX=pgCC FC=pgf90 F90=pgf90 >> --prefix=/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.2_cuda-6.0RC >> --enable-mpirun-prefix-by-default --with-hcoll=$HCOLL_DIR >> --with-fca=$FCA_DIR --with-mxm=$MXM_DIR --with-knem=$KNEM_DIR >> --with-slurm=/usr/local/Cluster-Apps/slurm >> --with-cuda=$CUDA_INSTALL_PATH >> >> >> At some point the compile process fails with this error: >> >> make[2]: Leaving directory >> `/home/fs395/archive/openmpi-1.7.4/build/ompi/mca/coll/hierarch' >> Making all in mca/coll/hcoll >> make[2]: Entering directory >> `/home/fs395/archive/openmpi-1.7.4/build/ompi/mca/coll/hcoll' >> CC coll_hcoll_module.lo >> CC coll_hcoll_component.lo >> CC coll_hcoll_rte.lo >> CC coll_hcoll_ops.lo >> CCLD mca_coll_hcoll.la >> pgcc-Error-Unknown switch: -pthread You have to remove the -pthread from inherited_linker_flags= in libpmi.la libslurm.la from your slurm build. >>> >>> With the configure line given above, I don't think he should be linking >>> against libslurm. >>> >>> But I wonder if the underlying issue is actually correct: perhaps the >>> inherited_linker_flags from libhcoll.la has -pthreads in it. >> >> >> -- >> Mr. Filippo SPIGA, M.Sc. >> http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga >> >> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert >> >> * >> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL >> and may be privileged or otherwise protected from disclosure. The contents >> are not to be disclosed to anyone other than the addressee. Unauthorized >> recipients are requested to preserve this confidentiality and to advise the >> sender immediately of any error in transmission." >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] efficient strategy with temporary message copy
On Mar 16, 2014, at 10:24 PM, christophe petit wrote: > I am studying the optimization strategy when the number of communication > functions in a codeis high. > > My courses on MPI say two things for optimization which are contradictory : > > 1*) You have to use temporary message copy to allow non-blocking sending and > uncouple the sending and receiving There's a lot of schools of thought here, and the real answer is going to depend on your application. If the message is "short" (and the exact definition of "short" depends on your platform -- it varies depending on your CPU, your memory, your CPU/memory interconnect, ...etc.), then copying to a pre-allocated bounce buffer is typically a good idea. That lets you keep using your "real" buffer and not have to wait until communication is done. For "long" messages, the equation is a bit different. If "long" isn't "enormous", you might be able to have N buffers available, and simply work on 1 of them at a time in your main application and use the others for ongoing non-blocking communication. This is sometimes called "shadow" copies, or "ghost" copies. Such shadow copies are most useful when you receive something each iteration, for example. For example, something like this: buffer[0] = malloc(...); buffer[1] = malloc(...); current = 0; while (still_doing_iterations) { MPI_Irecv(buffer[current], ..., &req); /// work on buffer[current - 1] MPI_Wait(req, MPI_STATUS_IGNORE); current = 1 - current; } You get the idea. > 2*) Avoid using temporary message copy because the copy will add extra cost > on execution time. It will, if the memcpy cost is significant (especially compared to the network time to send it). If the memcpy is small/insignificant, then don't worry about it. You'll need to determine where this crossover point is, however. Also keep in mind that MPI and/or the underlying network stack will likely be doing these kinds of things under the covers for you. Indeed, if you send short messages -- even via MPI_SEND -- it may return "immediately", indicating that MPI says it's safe for you to use the send buffer. But that doesn't mean that the message has even actually left the current server and gone out onto the network yet (i.e., some other layer below you may have just done a memcpy because it was a short message, and the processing/sending of that message is still ongoing). > And then, we are adviced to do : > > - replace MPI_SEND by MPI_SSEND (synchroneous blocking sending) : it is said > that execution is divided by a factor 2 This very, very much depends on your application. MPI_SSEND won't return until the receiver has started to receive the message. For some communication patterns, putting in this additional level of synchronization is helpful -- it keeps all MPI processes in tighter synchronization and you might experience less jitter, etc. And therefore overall execution time is faster. But for others, it adds unnecessary delay. I'd say it's an over-generalization that simply replacing MPI_SEND with MPI_SSEND always reduces execution time by 2. > - use MPI_ISSEND and MPI_IRECV with MPI_WAIT function to synchronize > (synchroneous non-blocking sending) : it is said that execution is divided by > a factor 3 Again, it depends on the app. Generally, non-blocking communication is better -- *if your app can effectively overlap communication and computation*. If your app doesn't take advantage of this overlap, then you won't see such performance benefits. For example: MPI_Isend(buffer, ..., req); MPI_Wait(&req, ...); Technically, the above uses ISEND and WAIT... but it's actually probably going to be *slower* than using MPI_SEND because you've made multiple function calls with no additional work between the two -- so the app didn't effectively overlap the communication with any local computation. Hence: no performance benefit. > So what's the best optimization ? Do we have to use temporary message copy or > not and if yes, what's the case for ? As you can probably see from my text above, the answer is: it depends. :-) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Question about '--mca btl tcp,self'
On 03/17/2014 10:52 AM, Jeff Squyres (jsquyres) wrote: To add on to what Ralph said: 1. There are two different message passing paths in OMPI: - "OOB" (out of band): used for control messages - "BTL" (byte transfer layer): used for MPI traffic (there are actually others, but these seem to be the relevant 2 for your setup) 2. If you don't specify which OOB interfaces to use OMPI will (basically) just pick one. It doesn't really matter which one it uses; the OOB channel doesn't use too much bandwidth, and is mostly just during startup and shutdown. The one exception to this is stdout/stderr routing. If your MPI app writes to stdout/stderr, this also uses the OOB path. So if you output a LOT to stdout, then the OOB interface choice might matter. Hi All Not trying to hijack Jianyu's very interesting and informative questions and thread, I have two questions and one note about it. I promise to shut up after this. Is the interface that OOB picks and uses somehow related to how the hosts/nodes names listed in a "hostfile" (or in the mpiexec command -host option, or in the Torque/SGE/Slurm node file,) are resolved into IP addresses (via /etc/hosts, DNS or other mechanism)? In other words, does OOB pick the interface associated to the IP address that resolves the specific node name, or does OOB have its own will and picks whatever interface it wants? At some early point during startup I suppose mpiexec needs to touch base first time with each node, and I would guess the nodes' IP address (and the corresponding interface) plays a role then. Does OOB piggy-back that same interface to do its job? 3. If you don't specify which MPI interfaces to use, OMPI will basically find the "best" set of interfaces and use those. IP interfaces are always rated less than OS-bypass interfaces (e.g., verbs/IB). In a node outfitted with more than one Inifinband interface, can one choose which one OMPI is going to use (say, if one wants to reserve the other IB interface for IO)? In other words, are there verbs/rdma syntax equivalent to --mca btl_tcp_if_include and to --mca oob_tcp_if_include ? [Perhaps something like --mca btl_openib_if_include ...?] Forgive me if this question doesn't make sense, for maybe on its guts verbs/rdma already has a greedy policy of using everything available, but I don't know anything about it. Or, as you noticed, you can give a comma-delimited list of BTLs to use. OMPI will then use -- at most -- exactly those BTLs, but definitely no others. Each BTL typically has an additional parameter or parameters that can be used to specify which interfaces to use for the network interface type that that BTL uses. For example, btl_tcp_if_include tells the TCP BTL which interface(s) to use. Also, note that you seem to have missed a BTL: sm (shared memory). sm is the preferred BTL to use for same-server communication. This may be because several FAQs skip the sm BTL, even when it would be an appropriate/recommended choice to include in the BTL list. For instance: http://www.open-mpi.org/faq/?category=all#selecting-components http://www.open-mpi.org/faq/?category=all#tcp-selection The command line examples with an ellipsis "..." don't actually e xclude the use of "sm", but IMHO are too vague and somewhat misleading. I think this issue was reported/discussed before in the list, but somehow the FAQ were not fixed. Thank you, Gus Correa It is much faster than both the TCP loopback device (which OMPI excludes by default, BTW, which is probably why you got reachability errors when you specifying "--mca btl tcp,self") and the verbs (i.e., "openib") BTL for same-server communication. 4. If you don't specify anything, OMPI usually picks the best thing for you. In your case, it'll probably be equivalent to: mpirun --mca btl openib,sm,self ... And the control messages will flow across one of your IP interfaces. 5. If you want to be specific about which one it uses, you can specify oob_tcp_if_include. For example: mpirun --mca oob_tcp_if_include eth0 ... Make sense? On Mar 15, 2014, at 1:18 AM, Jianyu Liu wrote: On Mar 14, 2014, at 10:16:34 AM,Jeff Squyres wrote: On Mar 14, 2014, at 10:11 AM, Ralph Castain wrote: 1. If specified '--mca btl tcp,self', which interface application will run on, use GigE adaper OR use the OpenFabrics interface in IP over IB mode (just like a high performance GigE adapter) ? Both - ip over ib looks just like an Ethernet adaptor To be clear: the TCP BTL will use all TCP interfaces (regardless of underlying physical transport). Your GigE adapter and your IP adapter both present IP interfaces to>the OS, and both support TCP. So the TCP BTL will use them, because it just sees the TCP/IP interfaces. Thanks for your kindly input. Please see if I have understood correctly Assume there are two nework Gigabit Ethernet eth0-renamed : 192.168.[1-22].[1-14] / 255.255.192.0 InfiniBand network i
Re: [OMPI users] Question about '--mca btl tcp,self'
On Mar 17, 2014, at 9:37 AM, Gus Correa wrote: > On 03/17/2014 10:52 AM, Jeff Squyres (jsquyres) wrote: >> To add on to what Ralph said: >> >> 1. There are two different message passing paths in OMPI: >>- "OOB" (out of band): used for control messages >>- "BTL" (byte transfer layer): used for MPI traffic >>(there are actually others, but these seem to be the relevant 2 for your >> setup) >> >> 2. If you don't specify which OOB interfaces > to use OMPI will (basically) just pick one. > It doesn't really matter which one it uses; > the OOB channel doesn't use too much bandwidth, > and is mostly just during startup and shutdown. >> >> The one exception to this is stdout/stderr routing. > If your MPI app writes to stdout/stderr, this also uses the OOB path. > So if you output a LOT to stdout, then the OOB interface choice might matter. > > Hi All > > Not trying to hijack Jianyu's very interesting and informative questions and > thread, I have two questions and one note about it. > I promise to shut up after this. > > Is the interface that OOB picks and uses > somehow related to how the hosts/nodes names listed > in a "hostfile" > (or in the mpiexec command -host option, > or in the Torque/SGE/Slurm node file,) > are resolved into IP addresses (via /etc/hosts, DNS or other mechanism)? > > In other words, does OOB pick the interface associated to the IP address > that resolves the specific node name, or does OOB have its own will and > picks whatever interface it wants? The OOB on each node gets the list of available interfaces from the kernel on that node. When it needs to talk to someone on a remote node, it uses the standard mechanisms to resolve that node name to an IP address *if* it already isn't one - i.e., it checks the provided info to see if it is an IP address, and attempts to resolve the name if not. Once it has an IP address for the remote host, it checks its interfaces to see if one is on the same subnet as the remote IP. If so, then it uses that interface to create the connection. If none of the interfaces share the same subnet as the remote IP, then the OOB picks the first kernel-ordered interface and attempts to connect via that one, in the hope that there is a router in the system capable of passing the connection to the remote subnet. The OOB will cycle across all its interfaces in that manner until one indicates that it was indeed able to connect - if not, then we error out. > > At some early point during startup I suppose mpiexec > needs to touch base first time with each node, > and I would guess the nodes' IP address > (and the corresponding interface) plays a role then. > Does OOB piggy-back that same interface to do its job? Yes - once we establish that connection, we use it for whatever OOB communication is required. > >> >> 3. If you don't specify which MPI interfaces to use, OMPI will basically >> find the > "best" set of interfaces and use those. IP interfaces are always rated less > than > OS-bypass interfaces (e.g., verbs/IB). > > > In a node outfitted with more than one Inifinband interface, > can one choose which one OMPI is going to use (say, if one wants to > reserve the other IB interface for IO)? > > In other words, are there verbs/rdma syntax equivalent to > > --mca btl_tcp_if_include > > and to > > --mca oob_tcp_if_include ? > > [Perhaps something like --mca btl_openib_if_include ...?] Yes - exactly as you describe > > Forgive me if this question doesn't make sense, > for maybe on its guts verbs/rdma already has a greedy policy of using > everything available, but I don't know anything about it. > >> >> Or, as you noticed, you can give a comma-delimited list of BTLs to use. > OMPI will then use -- at most -- exactly those BTLs, but definitely no others. > Each BTL typically has an additional parameter or parameters that can be used > to specify which interfaces to use for the network interface type that that > BTL uses. > For example, btl_tcp_if_include tells the TCP BTL which interface(s) to use. >> >> Also, note that you seem to have missed a BTL: sm (shared memory). > sm is the preferred BTL to use for same-server communication. > > This may be because several FAQs skip the sm BTL, even when it would > be an appropriate/recommended choice to include in the BTL list. > For instance: > > http://www.open-mpi.org/faq/?category=all#selecting-components > http://www.open-mpi.org/faq/?category=all#tcp-selection > > The command line examples with an ellipsis "..." don't actually e > xclude the use of "sm", but IMHO are too vague and somewhat misleading. > > I think this issue was reported/discussed before in the list, > but somehow the FAQ were not fixed. I can try to do something about it - largely a question of time :-/ > > Thank you, > Gus Correa > > It is much faster than both the TCP loopback device > (which OMPI excludes by default, BTW, which is probably > why you got reachability errors when you speci
Re: [OMPI users] Question about '--mca btl tcp,self'
On Mar 17, 2014, at 12:37 PM, Gus Correa wrote: > In other words, does OOB pick the interface associated to the IP address > that resolves the specific node name, or does OOB have its own will and > picks whatever interface it wants? I'll let Ralph contribute the detail here, but it's basically the latter: the OOB has its own will and picks whatever interface it wants. But keep in mind that this is true for ALL OMPI communications (including MPI communications): the hostfile is unrelated to what interfaces are used. Early MPI implementations back in the 90's overloaded the use of the hostfile with which network interfaces were used. Open MPI has never used that approach: we have always used the hostfile (and --host, etc.) as simply a mechanism to specify which servers/compute nodes/whatever on which to run. Selection of interfaces to use for control messages and MPI messages are determined separately. > In a node outfitted with more than one Inifinband interface, > can one choose which one OMPI is going to use (say, if one wants to > reserve the other IB interface for IO)? Yes. Each BTL typically has it's own MCA param for this kind of thing. You might want to troll through ompi_info output to see if there's anything of interest to you. For example: ompi_info --param btl openib --level 9 (the "--level 9" option is new somewhere during the 1.7.x series; it will cause a syntax error in the 1.6 series) will show you all the MCA params for the openib BTL. The one you want for the openib BTL is: mpirun --mca btl_openib_if_include With the usnic BTL, we allow you to specify interfaces via two different kinds of values: mpirun --mca btl_usnic_if_include where interfaces can be: usnic_X (e.g., usnic_0) CIDR network address (e.g., 192.168.0.0/16) >> Also, note that you seem to have missed a BTL: sm (shared memory). > sm is the preferred BTL to use for same-server communication. > > This may be because several FAQs skip the sm BTL, even when it would > be an appropriate/recommended choice to include in the BTL list. > For instance: > > http://www.open-mpi.org/faq/?category=all#selecting-components This one seems to be ok. I think the item you're referring to in that entry is an example of the ^ negation operator. > http://www.open-mpi.org/faq/?category=all#tcp-selection Fixed. Thanks! -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] efficient strategy with temporary message copy
Thanks Jeff, I understand better the different cases and how to choose as a function of the situation 2014-03-17 16:31 GMT+01:00 Jeff Squyres (jsquyres) : > On Mar 16, 2014, at 10:24 PM, christophe petit < > christophe.peti...@gmail.com> wrote: > > > I am studying the optimization strategy when the number of communication > functions in a codeis high. > > > > My courses on MPI say two things for optimization which are > contradictory : > > > > 1*) You have to use temporary message copy to allow non-blocking sending > and uncouple the sending and receiving > > There's a lot of schools of thought here, and the real answer is going to > depend on your application. > > If the message is "short" (and the exact definition of "short" depends on > your platform -- it varies depending on your CPU, your memory, your > CPU/memory interconnect, ...etc.), then copying to a pre-allocated bounce > buffer is typically a good idea. That lets you keep using your "real" > buffer and not have to wait until communication is done. > > For "long" messages, the equation is a bit different. If "long" isn't > "enormous", you might be able to have N buffers available, and simply work > on 1 of them at a time in your main application and use the others for > ongoing non-blocking communication. This is sometimes called "shadow" > copies, or "ghost" copies. > > Such shadow copies are most useful when you receive something each > iteration, for example. For example, something like this: > > buffer[0] = malloc(...); > buffer[1] = malloc(...); > current = 0; > while (still_doing_iterations) { > MPI_Irecv(buffer[current], ..., &req); > /// work on buffer[current - 1] > MPI_Wait(req, MPI_STATUS_IGNORE); > current = 1 - current; > } > > You get the idea. > > > 2*) Avoid using temporary message copy because the copy will add extra > cost on execution time. > > It will, if the memcpy cost is significant (especially compared to the > network time to send it). If the memcpy is small/insignificant, then don't > worry about it. > > You'll need to determine where this crossover point is, however. > > Also keep in mind that MPI and/or the underlying network stack will likely > be doing these kinds of things under the covers for you. Indeed, if you > send short messages -- even via MPI_SEND -- it may return "immediately", > indicating that MPI says it's safe for you to use the send buffer. But > that doesn't mean that the message has even actually left the current > server and gone out onto the network yet (i.e., some other layer below you > may have just done a memcpy because it was a short message, and the > processing/sending of that message is still ongoing). > > > And then, we are adviced to do : > > > > - replace MPI_SEND by MPI_SSEND (synchroneous blocking sending) : it is > said that execution is divided by a factor 2 > > This very, very much depends on your application. > > MPI_SSEND won't return until the receiver has started to receive the > message. > > For some communication patterns, putting in this additional level of > synchronization is helpful -- it keeps all MPI processes in tighter > synchronization and you might experience less jitter, etc. And therefore > overall execution time is faster. > > But for others, it adds unnecessary delay. > > I'd say it's an over-generalization that simply replacing MPI_SEND with > MPI_SSEND always reduces execution time by 2. > > > - use MPI_ISSEND and MPI_IRECV with MPI_WAIT function to synchronize > (synchroneous non-blocking sending) : it is said that execution is divided > by a factor 3 > > Again, it depends on the app. Generally, non-blocking communication is > better -- *if your app can effectively overlap communication and > computation*. > > If your app doesn't take advantage of this overlap, then you won't see > such performance benefits. For example: > >MPI_Isend(buffer, ..., req); >MPI_Wait(&req, ...); > > Technically, the above uses ISEND and WAIT... but it's actually probably > going to be *slower* than using MPI_SEND because you've made multiple > function calls with no additional work between the two -- so the app didn't > effectively overlap the communication with any local computation. Hence: > no performance benefit. > > > So what's the best optimization ? Do we have to use temporary message > copy or not and if yes, what's the case for ? > > As you can probably see from my text above, the answer is: it depends. :-) > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] OpenMpi-java Examples
Hi Madhurima, Currently we only have tests which start MPI and check the provided level of thread support: int provided = MPI.InitThread(args, MPI.THREAD_FUNNELED); if(provided < MPI.THREAD_FUNNELED) { throw new MPIException("MPI_Init_thread returned less "+ "than MPI_THREAD_FUNNELED.\n"); } Regards, Oscar Quoting madhurima madhunapanthula : hi, Iam new to OpenMPI. I have installed the java bindings of OpenMPI and running some samples in the cluster. iam interested in some samples using THREAD_SERIALIZED and THREAD_FUNNLED fields in OpenMPI. please provide me some samples. -- Lokah samasta sukhinobhavanthu Thanks, Madhurima This message was sent using IMP, the Internet Messaging Program.
[OMPI users] Usage of MPI_Win_create with MPI_Comm_Spawn
Hi, Can comm_spawn be used with win_create? For ex: Master process: --- MPI_Comm_spawn(worker_program,MPI_ARGV_NULL, world_size-1, info, 0, MPI_COMM_SELF, &everyone, MPI_ERRCODES_IGNORE); MPI_Win_create(&testval, sizeof(double), 1, MPI_INFO_NULL, everyone, &nwin); Worker process: MPI_Comm_get_parent(&parent); if (parent == MPI_COMM_NULL) error("No parent!"); MPI_Comm_remote_size(parent, &size); if (size != 1) error("Something's wrong with the parent"); MPI_Win_create(MPI_BOTTOM, 0, 1, MPI_INFO_NULL, parent, &nwin); This one fails currently. Am I doing something wrong. It would be great if someone could help me. Thanks Ramesh
Re: [OMPI users] efficient strategy with temporary message copy
Also, this presentation might be useful http://extremecomputingtraining.anl.gov/files/2013/07/tuesday-slides2.pdf Thank you, Saliya On Mar 17, 2014 2:18 PM, "christophe petit" wrote: > Thanks Jeff, I understand better the different cases and how to choose as > a function of the situation > > > 2014-03-17 16:31 GMT+01:00 Jeff Squyres (jsquyres) : > >> On Mar 16, 2014, at 10:24 PM, christophe petit < >> christophe.peti...@gmail.com> wrote: >> >> > I am studying the optimization strategy when the number of >> communication functions in a codeis high. >> > >> > My courses on MPI say two things for optimization which are >> contradictory : >> > >> > 1*) You have to use temporary message copy to allow non-blocking >> sending and uncouple the sending and receiving >> >> There's a lot of schools of thought here, and the real answer is going to >> depend on your application. >> >> If the message is "short" (and the exact definition of "short" depends on >> your platform -- it varies depending on your CPU, your memory, your >> CPU/memory interconnect, ...etc.), then copying to a pre-allocated bounce >> buffer is typically a good idea. That lets you keep using your "real" >> buffer and not have to wait until communication is done. >> >> For "long" messages, the equation is a bit different. If "long" isn't >> "enormous", you might be able to have N buffers available, and simply work >> on 1 of them at a time in your main application and use the others for >> ongoing non-blocking communication. This is sometimes called "shadow" >> copies, or "ghost" copies. >> >> Such shadow copies are most useful when you receive something each >> iteration, for example. For example, something like this: >> >> buffer[0] = malloc(...); >> buffer[1] = malloc(...); >> current = 0; >> while (still_doing_iterations) { >> MPI_Irecv(buffer[current], ..., &req); >> /// work on buffer[current - 1] >> MPI_Wait(req, MPI_STATUS_IGNORE); >> current = 1 - current; >> } >> >> You get the idea. >> >> > 2*) Avoid using temporary message copy because the copy will add extra >> cost on execution time. >> >> It will, if the memcpy cost is significant (especially compared to the >> network time to send it). If the memcpy is small/insignificant, then don't >> worry about it. >> >> You'll need to determine where this crossover point is, however. >> >> Also keep in mind that MPI and/or the underlying network stack will >> likely be doing these kinds of things under the covers for you. Indeed, if >> you send short messages -- even via MPI_SEND -- it may return >> "immediately", indicating that MPI says it's safe for you to use the send >> buffer. But that doesn't mean that the message has even actually left the >> current server and gone out onto the network yet (i.e., some other layer >> below you may have just done a memcpy because it was a short message, and >> the processing/sending of that message is still ongoing). >> >> > And then, we are adviced to do : >> > >> > - replace MPI_SEND by MPI_SSEND (synchroneous blocking sending) : it is >> said that execution is divided by a factor 2 >> >> This very, very much depends on your application. >> >> MPI_SSEND won't return until the receiver has started to receive the >> message. >> >> For some communication patterns, putting in this additional level of >> synchronization is helpful -- it keeps all MPI processes in tighter >> synchronization and you might experience less jitter, etc. And therefore >> overall execution time is faster. >> >> But for others, it adds unnecessary delay. >> >> I'd say it's an over-generalization that simply replacing MPI_SEND with >> MPI_SSEND always reduces execution time by 2. >> >> > - use MPI_ISSEND and MPI_IRECV with MPI_WAIT function to synchronize >> (synchroneous non-blocking sending) : it is said that execution is divided >> by a factor 3 >> >> Again, it depends on the app. Generally, non-blocking communication is >> better -- *if your app can effectively overlap communication and >> computation*. >> >> If your app doesn't take advantage of this overlap, then you won't see >> such performance benefits. For example: >> >>MPI_Isend(buffer, ..., req); >>MPI_Wait(&req, ...); >> >> Technically, the above uses ISEND and WAIT... but it's actually probably >> going to be *slower* than using MPI_SEND because you've made multiple >> function calls with no additional work between the two -- so the app didn't >> effectively overlap the communication with any local computation. Hence: >> no performance benefit. >> >> > So what's the best optimization ? Do we have to use temporary message >> copy or not and if yes, what's the case for ? >> >> As you can probably see from my text above, the answer is: it depends. >> :-) >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> users mail
[OMPI users] another corner case hangup in openmpi-1.7.5rc3
Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3. Condition: 1. allocate some nodes using RM such as TORQUE. 2. request the head node only in executing the job with -host or -hostfile option. Example: 1. allocate node05,node06 using TORQUE. 2. request node05 only with -host option [mishima@manage ~]$ qsub -I -l nodes=node05+node06 qsub: waiting for job 8661.manage.cluster to start qsub: job 8661.manage.cluster ready [mishima@node05 ~]$ cat $PBS_NODEFILE node05 node06 [mishima@node05 ~]$ mpirun -np 1 -host node05 ~/mis/openmpi/demos/myprog << hang here >> And, my fix for plm_base_launch_support.c is as follows: --- plm_base_launch_support.c 2014-03-12 05:51:45.0 +0900 +++ plm_base_launch_support.try.c 2014-03-18 08:38:03.0 +0900 @@ -1662,7 +1662,11 @@ OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, "%s plm:base:setup_vm only HNP left", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); +/* cleanup */ OBJ_DESTRUCT(&nodes); +/* mark that the daemons have reported so we can proceed */ +daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED; +daemons->updated = false; return ORTE_SUCCESS; } Tetsuya
Re: [OMPI users] another corner case hangup in openmpi-1.7.5rc3
Hmm...no, I don't think that's the correct patch. We want that function to remain "clean" as it's job is simply to construct the list of nodes for the VM. It's the responsibility of the launcher to decide what to do with it. Please see https://svn.open-mpi.org/trac/ompi/ticket/4408 for a fix Ralph On Mar 17, 2014, at 5:40 PM, tmish...@jcity.maeda.co.jp wrote: > > Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3. > > Condition: > 1. allocate some nodes using RM such as TORQUE. > 2. request the head node only in executing the job with > -host or -hostfile option. > > Example: > 1. allocate node05,node06 using TORQUE. > 2. request node05 only with -host option > > [mishima@manage ~]$ qsub -I -l nodes=node05+node06 > qsub: waiting for job 8661.manage.cluster to start > qsub: job 8661.manage.cluster ready > > [mishima@node05 ~]$ cat $PBS_NODEFILE > node05 > node06 > [mishima@node05 ~]$ mpirun -np 1 -host node05 ~/mis/openmpi/demos/myprog > << hang here >> > > And, my fix for plm_base_launch_support.c is as follows: > --- plm_base_launch_support.c 2014-03-12 05:51:45.0 +0900 > +++ plm_base_launch_support.try.c 2014-03-18 08:38:03.0 +0900 > @@ -1662,7 +1662,11 @@ > OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, > "%s plm:base:setup_vm only HNP left", > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); > +/* cleanup */ > OBJ_DESTRUCT(&nodes); > +/* mark that the daemons have reported so we can proceed */ > +daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED; > +daemons->updated = false; > return ORTE_SUCCESS; > } > > Tetsuya > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] another corner case hangup in openmpi-1.7.5rc3
I do not understand your fix yet, but it would be better, I guess. I'll check it later, but now please let me expalin what I thought: If some nodes are allocated, it doen't go through this part because opal_list_get_size(&nodes) > 0 at this location. 1590if (0 == opal_list_get_size(&nodes)) { 1591OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, 1592 "%s plm:base:setup_vm only HNP in allocation", 1593 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); 1594/* cleanup */ 1595OBJ_DESTRUCT(&nodes); 1596/* mark that the daemons have reported so we can proceed */ 1597daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED; 1598 daemons->updated = false; 1599return ORTE_SUCCESS; 1600} After filtering, opal_list_get_size(&nodes) becomes zero at this location. That's why I think I should add two lines 1597,1598 to the if-clause below. 1660if (0 == opal_list_get_size(&nodes)) { 1661OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, 1662 "%s plm:base:setup_vm only HNP left", 1663 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); 1664OBJ_DESTRUCT(&nodes); 1665return ORTE_SUCCESS; Tetsuya > Hmm...no, I don't think that's the correct patch. We want that function to remain "clean" as it's job is simply to construct the list of nodes for the VM. It's the responsibility of the launcher to > decide what to do with it. > > Please see https://svn.open-mpi.org/trac/ompi/ticket/4408 for a fix > > Ralph > > On Mar 17, 2014, at 5:40 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3. > > > > Condition: > > 1. allocate some nodes using RM such as TORQUE. > > 2. request the head node only in executing the job with > > -host or -hostfile option. > > > > Example: > > 1. allocate node05,node06 using TORQUE. > > 2. request node05 only with -host option > > > > [mishima@manage ~]$ qsub -I -l nodes=node05+node06 > > qsub: waiting for job 8661.manage.cluster to start > > qsub: job 8661.manage.cluster ready > > > > [mishima@node05 ~]$ cat $PBS_NODEFILE > > node05 > > node06 > > [mishima@node05 ~]$ mpirun -np 1 -host node05 ~/mis/openmpi/demos/myprog > > << hang here >> > > > > And, my fix for plm_base_launch_support.c is as follows: > > --- plm_base_launch_support.c 2014-03-12 05:51:45.0 +0900 > > +++ plm_base_launch_support.try.c 2014-03-18 08:38:03.0 +0900 > > @@ -1662,7 +1662,11 @@ > > OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, > > "%s plm:base:setup_vm only HNP left", > > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); > > +/* cleanup */ > > OBJ_DESTRUCT(&nodes); > > +/* mark that the daemons have reported so we can proceed */ > > +daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED; > > +daemons->updated = false; > > return ORTE_SUCCESS; > > } > > > > Tetsuya > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI 1.7.4 with --enable-mpi-thread-multiple gives MPI_Recv error
Hello, Gustavo Correa wrote: I guess you need to provide buffers of char type to MPI_Send and MPI_Recv, not NULL. That was not the problem, I was anyway using message size 0, so then it should be OK to give NULL as the buffer pointer. I did find the problem now; it turns out that this was not at all due to any bug in Open MPI, it was my program that had a bug; I used wrong constant specifying the datatype. I used MPI_CHARACTER which I thought would correspond to a char or unsigned char in C/C++. But now when I checked the MPI standard it says that MPI_CHARACTER is for the Fortran CHARACTER type. Since I am using C, not Fortran, I should use MPI_CHAR or MPI_SIGNED_CHAR or MPI_UNSIGNED_CHAR. Now I have corrected my program by changing MPI_CHARACTER to MPI_UNSIGNED_CHAR, and then it works. Sorry for reporting this as a bug in Open MPI, it was really a bug in my own code. / Elias Quoting Gustavo Correa : I guess you need to provide buffers of char type to MPI_Send and MPI_Recv, not NULL. On Mar 16, 2014, at 8:04 PM, Elias Rudberg wrote: Hi Ralph, Thanks for the quick answer! Try running the "ring" program in our example directory and see if that works I just did this, and it works. (I ran ring_c.c) Looking in your ring_c.c code, I see that it is quite similar to my test program but one thing that differs is the datatype: the ring program uses MPI_INT but my test uses MPI_CHARACTER. I tried changing from MPI_INT to MPI_CHARACTER in ring_c.c (and the type of the variable "message" from int to char), and then ring_c.c fails in the same way as my test code. And my code works if changing from MPI_CHARACTER to MPI_INT. So, it looks like the there is a bug that is triggered when using MPI_CHARACTER, but it works with MPI_INT. / Elias Quoting Ralph Castain : Try running the "ring" program in our example directory and see if that works On Mar 16, 2014, at 4:26 PM, Elias Rudberg wrote: Hello! I would like to report a bug in Open MPI 1.7.4 when compiled with --enable-mpi-thread-multiple. The bug can be reproduced with the following test program (mpi-send-recv.c): === #include #include int main() { MPI_Init(NULL, NULL); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("Rank %d at start\n", rank); if (rank) MPI_Send(NULL, 0, MPI_CHARACTER, 0, 0, MPI_COMM_WORLD); else MPI_Recv(NULL, 0, MPI_CHARACTER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Rank %d at end\n", rank); MPI_Finalize(); return 0; } === With Open MPI 1.7.4 compiled with --enable-mpi-thread-multiple, the test program above fails like this: $ mpirun -np 2 ./a.out Rank 0 at start Rank 1 at start [elias-p6-2022scm:2743] *** An error occurred in MPI_Recv [elias-p6-2022scm:2743] *** reported by process [140733606985729,140256452018176] [elias-p6-2022scm:2743] *** on communicator MPI_COMM_WORLD [elias-p6-2022scm:2743] *** MPI_ERR_TYPE: invalid datatype [elias-p6-2022scm:2743] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [elias-p6-2022scm:2743] ***and potentially your MPI job) Steps I use to reproduce this in Ubuntu: (1) Download openmpi-1.7.4.tar.gz (2) Configure like this: ./configure --enable-mpi-thread-multiple (3) make (4) Compile test program like this: mpicc mpi-send-recv.c (5) Run like this: mpirun -np 2 ./a.out This gives the error above. Of course, in my actual application I will want to call MPI_Init_thread with MPI_THREAD_MULTIPLE instead of just MPI_Init, but that does not seem to matter for this error; the same error comes regardless of the way I call MPI_Init/MPI_Init_thread. So I just put MPI_Init in the test code above to make it as short as possible. Do you agree that this is a bug, or am I doing something wrong? Any ideas for workarounds to make things work with --enable-mpi-thread-multiple? (I do need threads, so skipping --enable-mpi-thread-multiple is probably not an option for me.) Best regards, Elias ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] another corner case hangup in openmpi-1.7.5rc3
Understood, and your logic is correct. It's just that I'd rather each launcher decide to declare the daemons as reported rather than doing it in the common code, just in case someone writes a launcher where they choose to respond differently to the case where no new daemons need to be launched. On Mar 17, 2014, at 6:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > I do not understand your fix yet, but it would be better, I guess. > > I'll check it later, but now please let me expalin what I thought: > > If some nodes are allocated, it doen't go through this part because > opal_list_get_size(&nodes) > 0 at this location. > > 1590if (0 == opal_list_get_size(&nodes)) { > 1591OPAL_OUTPUT_VERBOSE((5, > orte_plm_base_framework.framework_output, > 1592 "%s plm:base:setup_vm only HNP in > allocation", > 1593 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); > 1594/* cleanup */ > 1595OBJ_DESTRUCT(&nodes); > 1596/* mark that the daemons have reported so we can proceed */ > 1597daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED; > 1598 daemons->updated = false; > 1599return ORTE_SUCCESS; > 1600} > > After filtering, opal_list_get_size(&nodes) becomes zero at this location. > That's why I think I should add two lines 1597,1598 to the if-clause below. > > 1660if (0 == opal_list_get_size(&nodes)) { > 1661OPAL_OUTPUT_VERBOSE((5, > orte_plm_base_framework.framework_output, > 1662 "%s plm:base:setup_vm only HNP left", > 1663 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); > 1664OBJ_DESTRUCT(&nodes); > 1665return ORTE_SUCCESS; > > Tetsuya > >> Hmm...no, I don't think that's the correct patch. We want that function > to remain "clean" as it's job is simply to construct the list of nodes for > the VM. It's the responsibility of the launcher to >> decide what to do with it. >> >> Please see https://svn.open-mpi.org/trac/ompi/ticket/4408 for a fix >> >> Ralph >> >> On Mar 17, 2014, at 5:40 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3. >>> >>> Condition: >>> 1. allocate some nodes using RM such as TORQUE. >>> 2. request the head node only in executing the job with >>> -host or -hostfile option. >>> >>> Example: >>> 1. allocate node05,node06 using TORQUE. >>> 2. request node05 only with -host option >>> >>> [mishima@manage ~]$ qsub -I -l nodes=node05+node06 >>> qsub: waiting for job 8661.manage.cluster to start >>> qsub: job 8661.manage.cluster ready >>> >>> [mishima@node05 ~]$ cat $PBS_NODEFILE >>> node05 >>> node06 >>> [mishima@node05 ~]$ mpirun -np 1 -host node05 > ~/mis/openmpi/demos/myprog >>> << hang here >> >>> >>> And, my fix for plm_base_launch_support.c is as follows: >>> --- plm_base_launch_support.c 2014-03-12 05:51:45.0 +0900 >>> +++ plm_base_launch_support.try.c 2014-03-18 08:38:03.0 > +0900 >>> @@ -1662,7 +1662,11 @@ >>>OPAL_OUTPUT_VERBOSE((5, > orte_plm_base_framework.framework_output, >>> "%s plm:base:setup_vm only HNP left", >>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); >>> +/* cleanup */ >>>OBJ_DESTRUCT(&nodes); >>> +/* mark that the daemons have reported so we can proceed */ >>> +daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED; >>> +daemons->updated = false; >>>return ORTE_SUCCESS; >>>} >>> >>> Tetsuya >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users