[OMPI users] OpenMpi-java Examples

2014-03-17 Thread madhurima madhunapanthula
hi,

Iam new to OpenMPI. I have installed the java bindings of OpenMPI and
running some samples in the cluster.
iam interested in some samples using THREAD_SERIALIZED and THREAD_FUNNLED
fields in OpenMPI. please provide me some samples.

-- 
Lokah samasta sukhinobhavanthu

Thanks,
Madhurima


Re: [OMPI users] Question about '--mca btl tcp,self'

2014-03-17 Thread Jeff Squyres (jsquyres)
To add on to what Ralph said:

1. There are two different message passing paths in OMPI:
   - "OOB" (out of band): used for control messages
   - "BTL" (byte transfer layer): used for MPI traffic
   (there are actually others, but these seem to be the relevant 2 for your 
setup)

2. If you don't specify which OOB interfaces to use OMPI will (basically) just 
pick one.  It doesn't really matter which one it uses; the OOB channel doesn't 
use too much bandwidth, and is mostly just during startup and shutdown.

The one exception to this is stdout/stderr routing.  If your MPI app writes to 
stdout/stderr, this also uses the OOB path.  So if you output a LOT to stdout, 
then the OOB interface choice might matter.

3. If you don't specify which MPI interfaces to use, OMPI will basically find 
the "best" set of interfaces and use those.  IP interfaces are always rated 
less than OS-bypass interfaces (e.g., verbs/IB).

Or, as you noticed, you can give a comma-delimited list of BTLs to use.  OMPI 
will then use -- at most -- exactly those BTLs, but definitely no others.  Each 
BTL typically has an additional parameter or parameters that can be used to 
specify which interfaces to use for the network interface type that that BTL 
uses.  For example, btl_tcp_if_include tells the TCP BTL which interface(s) to 
use.

Also, note that you seem to have missed a BTL: sm (shared memory).  sm is the 
preferred BTL to use for same-server communication.  It is much faster than 
both the TCP loopback device (which OMPI excludes by default, BTW, which is 
probably why you got reachability errors when you specifying "--mca btl 
tcp,self") and the verbs (i.e., "openib") BTL for same-server communication.

4. If you don't specify anything, OMPI usually picks the best thing for you.  
In your case, it'll probably be equivalent to:

 mpirun --mca btl openib,sm,self ...

And the control messages will flow across one of your IP interfaces.  

5. If you want to be specific about which one it uses, you can specify 
oob_tcp_if_include.  For example:

  mpirun --mca oob_tcp_if_include eth0 ...

Make sense?



On Mar 15, 2014, at 1:18 AM, Jianyu Liu  wrote:

>> On Mar 14, 2014, at 10:16:34 AM,Jeff Squyres  wrote: 
>> 
>>> On Mar 14, 2014, at 10:11 AM, Ralph Castain  wrote: 
>>> 
 1. If specified '--mca btl tcp,self', which interface application will run 
 on, use GigE adaper OR use the OpenFabrics interface in IP over IB mode 
 (just like a high performance GigE adapter) ? 
>>> 
>>> Both - ip over ib looks just like an Ethernet adaptor 
>> 
>> 
>> To be clear: the TCP BTL will use all TCP interfaces (regardless of 
>> underlying physical transport). Your GigE adapter and your IP adapter both 
>> present IP interfaces to>the OS, and both support TCP. So the TCP BTL will 
>> use them, because it just sees the TCP/IP interfaces. 
> 
> Thanks for your kindly input.
> 
> Please see if I have understood correctly
> 
> Assume there are two nework
>   Gigabit Ethernet
> 
> eth0-renamed : 192.168.[1-22].[1-14] / 255.255.192.0
> 
>   InfiniBand network
> 
> ib0 :  172.20.[1-22].[1-4] / 255.255.0.0
> 
> 
> 1. If specified '--mca btl tcp,self
> 
> The control information ( such as setup and teardown ) are routed to and 
> passed by Gigabit Ethernet in TCP/IP mode
> The MPI messages are routed to and passed by InfiniBand network in IP 
> over IB mode
> On the same machine, the TCP lookback device will be used for passing 
> control and MPI messages 
> 
> 2. If specified '--mca btl tcp,self --mca btl_tcp_if_include ib0'
> 
> Both of control information ( such as setup and teardown ) and MPI 
> messages are routed to and passed by InfiniBand network in IP over IB mode
> On the same machine, The TCP lookback device will be used for passing 
> control and MPI messages
> 
> 
> 3. If specified '--mca btl openib,self'
> 
> The control information ( such as setup and teardown ) are routed to and 
> passed by InfiniBand network in IP over IB mode
> The MPI messages are routed to and passed by InfiniBand network in RDMA 
> mode
> On the same machine, the TCP lookback device will be used for passing 
> control and MPI messages
> 
> 
> 4. If without specifiying any 'mca btl' parameters
> 
> The control information ( such as setup and teardown ) are routed to and 
> passed by Gigabit Ethernet in TCP/IP mode
> The MPI messages are routed and passed by InfiniBand network in RDMA mode
> On the same machine, the shared memory (sm) BTL will be used for control 
> and MPI passing messages
> 
> 
> Appreciating your kindly input
> 
> Jianyu  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Compiling Open MPI 1.7.4 using PGI 14.2 and Mellanox HCOLL enabled

2014-03-17 Thread Jeff Squyres (jsquyres)
Ralph -- it seems to be picking up "-pthread" from libslurm.la (i.e., outside 
of the OMPI tree), which pgcc doesn't seem to like.

Another solution might be to (temporarily?) remove the "-pthread" from 
libslurm.la (which is a text file that you can edit).  Then OMPI shouldn't pick 
up that flag, and building should be ok.



On Mar 16, 2014, at 11:50 AM, Ralph Castain  wrote:

> If you are running on a Slurm-managed cluster, it won't be happy without 
> configuring --with-slurm - you won't see the allocation, for one.
> 
> Is it just the --with-slurm option that causes the problem? In other words, 
> if you remove the rest of those options (starting --with-hcoll and going down 
> that config line) and leave --with-slurm, does it build?
> 
> On Mar 16, 2014, at 8:22 AM, Filippo Spiga  wrote:
> 
>> Hi Jeff, Hi Ake,
>> 
>> removing --with-slurm and keeping --with-hcoll seems to work. The error 
>> disappears at compile time, I have not yet tried to run a job. I can copy 
>> config.log and the make.log is needed.
>> 
>> Cheers,
>> F
>> 
>> On Mar 11, 2014, at 4:48 PM, Jeff Squyres (jsquyres)  
>> wrote:
>>> On Mar 11, 2014, at 11:22 AM, Åke Sandgren  
>>> wrote:
>>> 
>> ../configure  CC=pgcc CXX=pgCC FC=pgf90 F90=pgf90 
>> --prefix=/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.2_cuda-6.0RC
>>   --enable-mpirun-prefix-by-default --with-hcoll=$HCOLL_DIR 
>> --with-fca=$FCA_DIR --with-mxm=$MXM_DIR --with-knem=$KNEM_DIR 
>> --with-slurm=/usr/local/Cluster-Apps/slurm  
>> --with-cuda=$CUDA_INSTALL_PATH
>> 
>> 
>> At some point the compile process fails with this error:
>> 
>> make[2]: Leaving directory 
>> `/home/fs395/archive/openmpi-1.7.4/build/ompi/mca/coll/hierarch'
>> Making all in mca/coll/hcoll
>> make[2]: Entering directory 
>> `/home/fs395/archive/openmpi-1.7.4/build/ompi/mca/coll/hcoll'
>> CC   coll_hcoll_module.lo
>> CC   coll_hcoll_component.lo
>> CC   coll_hcoll_rte.lo
>> CC   coll_hcoll_ops.lo
>> CCLD mca_coll_hcoll.la
>> pgcc-Error-Unknown switch: -pthread
 
 You have to remove the -pthread from inherited_linker_flags=
 in libpmi.la libslurm.la from your slurm build.
>>> 
>>> With the configure line given above, I don't think he should be linking 
>>> against libslurm.
>>> 
>>> But I wonder if the underlying issue is actually correct: perhaps the 
>>> inherited_linker_flags from libhcoll.la has -pthreads in it.
>> 
>> 
>> --
>> Mr. Filippo SPIGA, M.Sc.
>> http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
>> 
>> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>> 
>> *
>> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL 
>> and may be privileged or otherwise protected from disclosure. The contents 
>> are not to be disclosed to anyone other than the addressee. Unauthorized 
>> recipients are requested to preserve this confidentiality and to advise the 
>> sender immediately of any error in transmission."
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] efficient strategy with temporary message copy

2014-03-17 Thread Jeff Squyres (jsquyres)
On Mar 16, 2014, at 10:24 PM, christophe petit  
wrote:

> I am studying the optimization strategy when the number of communication 
> functions in a codeis high.
> 
> My courses on MPI say two things for optimization which are contradictory :
> 
> 1*) You have to use temporary message copy to allow non-blocking sending and 
> uncouple the sending and receiving

There's a lot of schools of thought here, and the real answer is going to 
depend on your application.

If the message is "short" (and the exact definition of "short" depends on your 
platform -- it varies depending on your CPU, your memory, your CPU/memory 
interconnect, ...etc.), then copying to a pre-allocated bounce buffer is 
typically a good idea.  That lets you keep using your "real" buffer and not 
have to wait until communication is done.

For "long" messages, the equation is a bit different.  If "long" isn't 
"enormous", you might be able to have N buffers available, and simply work on 1 
of them at a time in your main application and use the others for ongoing 
non-blocking communication.  This is sometimes called "shadow" copies, or 
"ghost" copies.

Such shadow copies are most useful when you receive something each iteration, 
for example.  For example, something like this:

  buffer[0] = malloc(...);
  buffer[1] = malloc(...);
  current = 0;
  while (still_doing_iterations) {
  MPI_Irecv(buffer[current], ..., &req);
  /// work on buffer[current - 1]
  MPI_Wait(req, MPI_STATUS_IGNORE);
  current = 1 - current;
  }

You get the idea.

> 2*) Avoid using temporary message copy because the copy will add extra cost 
> on execution time. 

It will, if the memcpy cost is significant (especially compared to the network 
time to send it).  If the memcpy is small/insignificant, then don't worry about 
it.

You'll need to determine where this crossover point is, however.

Also keep in mind that MPI and/or the underlying network stack will likely be 
doing these kinds of things under the covers for you.  Indeed, if you send 
short messages -- even via MPI_SEND -- it may return "immediately", indicating 
that MPI says it's safe for you to use the send buffer.  But that doesn't mean 
that the message has even actually left the current server and gone out onto 
the network yet (i.e., some other layer below you may have just done a memcpy 
because it was a short message, and the processing/sending of that message is 
still ongoing).

> And then, we are adviced to do : 
> 
> - replace MPI_SEND by MPI_SSEND (synchroneous blocking sending) : it is said 
> that execution is divided by a factor 2

This very, very much depends on your application.

MPI_SSEND won't return until the receiver has started to receive the message.

For some communication patterns, putting in this additional level of 
synchronization is helpful -- it keeps all MPI processes in tighter 
synchronization and you might experience less jitter, etc.  And therefore 
overall execution time is faster.

But for others, it adds unnecessary delay.

I'd say it's an over-generalization that simply replacing MPI_SEND with 
MPI_SSEND always reduces execution time by 2.

> - use MPI_ISSEND and MPI_IRECV with MPI_WAIT function to synchronize 
> (synchroneous non-blocking sending) : it is said that execution is divided by 
> a factor 3

Again, it depends on the app.  Generally, non-blocking communication is better 
-- *if your app can effectively overlap communication and computation*.

If your app doesn't take advantage of this overlap, then you won't see such 
performance benefits.  For example:

   MPI_Isend(buffer, ..., req);
   MPI_Wait(&req, ...);

Technically, the above uses ISEND and WAIT... but it's actually probably going 
to be *slower* than using MPI_SEND because you've made multiple function calls 
with no additional work between the two -- so the app didn't effectively 
overlap the communication with any local computation.  Hence: no performance 
benefit.

> So what's the best optimization ? Do we have to use temporary message copy or 
> not and if yes, what's the case for ?

As you can probably see from my text above, the answer is: it depends.  :-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Question about '--mca btl tcp,self'

2014-03-17 Thread Gus Correa

On 03/17/2014 10:52 AM, Jeff Squyres (jsquyres) wrote:

To add on to what Ralph said:

1. There are two different message passing paths in OMPI:
- "OOB" (out of band): used for control messages
- "BTL" (byte transfer layer): used for MPI traffic
(there are actually others, but these seem to be the relevant 2 for your 
setup)

2. If you don't specify which OOB interfaces

to use OMPI will (basically) just pick one.
It doesn't really matter which one it uses;
the OOB channel doesn't use too much bandwidth,
and is mostly just during startup and shutdown.


The one exception to this is stdout/stderr routing.

If your MPI app writes to stdout/stderr, this also uses the OOB path.
So if you output a LOT to stdout, then the OOB interface choice might 
matter.


Hi All

Not trying to hijack Jianyu's very interesting and informative questions 
and thread, I have two questions and one note about it.

I promise to shut up after this.

Is the interface that OOB picks and uses
somehow related to how the hosts/nodes names listed
in a "hostfile"
(or in the mpiexec command -host option,
or in the Torque/SGE/Slurm node file,)
are resolved into IP addresses (via /etc/hosts, DNS or other mechanism)?

In other words, does OOB pick the interface associated to the IP address
that resolves the specific node name, or does OOB have its own will and
picks whatever interface it wants?

At some early point during startup I suppose mpiexec
needs to touch base first time with each node,
and I would guess the nodes' IP address
(and the corresponding interface) plays a role then.
Does OOB piggy-back that same interface to do its job?



3. If you don't specify which MPI interfaces to use, OMPI will basically find 
the
"best" set of interfaces and use those.  IP interfaces are always rated 
less than

OS-bypass interfaces (e.g., verbs/IB).


In a node outfitted with more than one Inifinband interface,
can one choose which one OMPI is going to use (say, if one wants to
reserve the other IB interface for IO)?

In other words, are there verbs/rdma syntax equivalent to

--mca btl_tcp_if_include

and to

--mca oob_tcp_if_include  ?

[Perhaps something like --mca btl_openib_if_include ...?]

Forgive me if this question doesn't make sense,
for maybe on its guts verbs/rdma already has a greedy policy of using 
everything available, but I don't know anything about it.




Or, as you noticed, you can give a comma-delimited list of BTLs to use.
OMPI will then use -- at most -- exactly those BTLs, but definitely no 
others.
Each BTL typically has an additional parameter or parameters that can be 
used
to specify which interfaces to use for the network interface type that 
that BTL uses.

For example, btl_tcp_if_include tells the TCP BTL which interface(s) to use.


Also, note that you seem to have missed a BTL: sm (shared memory).

sm is the preferred BTL to use for same-server communication.

This may be because several FAQs skip the sm BTL, even when it would
be an appropriate/recommended choice to include in the BTL list.
For instance:

http://www.open-mpi.org/faq/?category=all#selecting-components
http://www.open-mpi.org/faq/?category=all#tcp-selection

The command line examples with an ellipsis "..." don't actually e
xclude the use of "sm", but IMHO are too vague and somewhat misleading.

I think this issue was reported/discussed before in the list,
but somehow the FAQ were not fixed.

Thank you,
Gus Correa

It is much faster than both the TCP loopback device
(which OMPI excludes by default, BTW, which is probably
why you got reachability errors when you specifying
"--mca btl tcp,self") and the verbs (i.e., "openib")
BTL for same-server communication.


4. If you don't specify anything, OMPI usually picks the best thing for you.

In your case, it'll probably be equivalent to:


  mpirun --mca btl openib,sm,self ...

And the control messages will flow across one of your IP interfaces.

5. If you want to be specific about which one it uses,

you can specify oob_tcp_if_include.  For example:


   mpirun --mca oob_tcp_if_include eth0 ...

Make sense?



On Mar 15, 2014, at 1:18 AM, Jianyu Liu  wrote:


On Mar 14, 2014, at 10:16:34 AM,Jeff Squyres  wrote:


On Mar 14, 2014, at 10:11 AM, Ralph Castain  wrote:


1. If specified '--mca btl tcp,self', which interface application will run on, 
use GigE adaper OR use the OpenFabrics interface in IP over IB mode (just like 
a high performance GigE adapter) ?


Both - ip over ib looks just like an Ethernet adaptor



To be clear: the TCP BTL will use all TCP interfaces (regardless of underlying 
physical transport). Your GigE adapter and your IP adapter both present IP 
interfaces to>the OS, and both support TCP. So the TCP BTL will use them, 
because it just sees the TCP/IP interfaces.


Thanks for your kindly input.

Please see if I have understood correctly

Assume there are two nework
   Gigabit Ethernet

 eth0-renamed : 192.168.[1-22].[1-14] / 255.255.192.0

   InfiniBand network

 i

Re: [OMPI users] Question about '--mca btl tcp,self'

2014-03-17 Thread Ralph Castain

On Mar 17, 2014, at 9:37 AM, Gus Correa  wrote:

> On 03/17/2014 10:52 AM, Jeff Squyres (jsquyres) wrote:
>> To add on to what Ralph said:
>> 
>> 1. There are two different message passing paths in OMPI:
>>- "OOB" (out of band): used for control messages
>>- "BTL" (byte transfer layer): used for MPI traffic
>>(there are actually others, but these seem to be the relevant 2 for your 
>> setup)
>> 
>> 2. If you don't specify which OOB interfaces
> to use OMPI will (basically) just pick one.
> It doesn't really matter which one it uses;
> the OOB channel doesn't use too much bandwidth,
> and is mostly just during startup and shutdown.
>> 
>> The one exception to this is stdout/stderr routing.
> If your MPI app writes to stdout/stderr, this also uses the OOB path.
> So if you output a LOT to stdout, then the OOB interface choice might matter.
> 
> Hi All
> 
> Not trying to hijack Jianyu's very interesting and informative questions and 
> thread, I have two questions and one note about it.
> I promise to shut up after this.
> 
> Is the interface that OOB picks and uses
> somehow related to how the hosts/nodes names listed
> in a "hostfile"
> (or in the mpiexec command -host option,
> or in the Torque/SGE/Slurm node file,)
> are resolved into IP addresses (via /etc/hosts, DNS or other mechanism)?
> 
> In other words, does OOB pick the interface associated to the IP address
> that resolves the specific node name, or does OOB have its own will and
> picks whatever interface it wants?

The OOB on each node gets the list of available interfaces from the kernel on 
that node. When it needs to talk to someone on a remote node, it uses the 
standard mechanisms to resolve that node name to an IP address *if* it already 
isn't one - i.e., it checks the provided info to see if it is an IP address, 
and attempts to resolve the name if not.

Once it has an IP address for the remote host, it checks its interfaces to see 
if one is on the same subnet as the remote IP. If so, then it uses that 
interface to create the connection. If none of the interfaces share the same 
subnet as the remote IP, then the OOB picks the first kernel-ordered interface 
and attempts to connect via that one, in the hope that there is a router in the 
system capable of passing the connection to the remote subnet. The OOB will 
cycle across all its interfaces in that manner until one indicates that it was 
indeed able to connect - if not, then we error out.

> 
> At some early point during startup I suppose mpiexec
> needs to touch base first time with each node,
> and I would guess the nodes' IP address
> (and the corresponding interface) plays a role then.
> Does OOB piggy-back that same interface to do its job?

Yes - once we establish that connection, we use it for whatever OOB 
communication is required.

> 
>> 
>> 3. If you don't specify which MPI interfaces to use, OMPI will basically 
>> find the
> "best" set of interfaces and use those.  IP interfaces are always rated less 
> than
> OS-bypass interfaces (e.g., verbs/IB).
> 
> 
> In a node outfitted with more than one Inifinband interface,
> can one choose which one OMPI is going to use (say, if one wants to
> reserve the other IB interface for IO)?
> 
> In other words, are there verbs/rdma syntax equivalent to
> 
> --mca btl_tcp_if_include
> 
> and to
> 
> --mca oob_tcp_if_include  ?
> 
> [Perhaps something like --mca btl_openib_if_include ...?]

Yes - exactly as you describe

> 
> Forgive me if this question doesn't make sense,
> for maybe on its guts verbs/rdma already has a greedy policy of using 
> everything available, but I don't know anything about it.
> 
>> 
>> Or, as you noticed, you can give a comma-delimited list of BTLs to use.
> OMPI will then use -- at most -- exactly those BTLs, but definitely no others.
> Each BTL typically has an additional parameter or parameters that can be used
> to specify which interfaces to use for the network interface type that that 
> BTL uses.
> For example, btl_tcp_if_include tells the TCP BTL which interface(s) to use.
>> 
>> Also, note that you seem to have missed a BTL: sm (shared memory).
> sm is the preferred BTL to use for same-server communication.
> 
> This may be because several FAQs skip the sm BTL, even when it would
> be an appropriate/recommended choice to include in the BTL list.
> For instance:
> 
> http://www.open-mpi.org/faq/?category=all#selecting-components
> http://www.open-mpi.org/faq/?category=all#tcp-selection
> 
> The command line examples with an ellipsis "..." don't actually e
> xclude the use of "sm", but IMHO are too vague and somewhat misleading.
> 
> I think this issue was reported/discussed before in the list,
> but somehow the FAQ were not fixed.

I can try to do something about it - largely a question of time :-/

> 
> Thank you,
> Gus Correa
> 
> It is much faster than both the TCP loopback device
> (which OMPI excludes by default, BTW, which is probably
> why you got reachability errors when you speci

Re: [OMPI users] Question about '--mca btl tcp,self'

2014-03-17 Thread Jeff Squyres (jsquyres)
On Mar 17, 2014, at 12:37 PM, Gus Correa  wrote:

> In other words, does OOB pick the interface associated to the IP address
> that resolves the specific node name, or does OOB have its own will and
> picks whatever interface it wants?

I'll let Ralph contribute the detail here, but it's basically the latter: the 
OOB has its own will and picks whatever interface it wants.

But keep in mind that this is true for ALL OMPI communications (including MPI 
communications): the hostfile is unrelated to what interfaces are used.

Early MPI implementations back in the 90's overloaded the use of the hostfile 
with which network interfaces were used.  Open MPI has never used that 
approach: we have always used the hostfile (and --host, etc.) as simply a 
mechanism to specify which servers/compute nodes/whatever on which to run.  
Selection of interfaces to use for control messages and MPI messages are 
determined separately.

> In a node outfitted with more than one Inifinband interface,
> can one choose which one OMPI is going to use (say, if one wants to
> reserve the other IB interface for IO)?

Yes.  Each BTL typically has it's own MCA param for this kind of thing.  You 
might want to troll through ompi_info output to see if there's anything of 
interest to you. For example:

  ompi_info --param btl openib --level 9

(the "--level 9" option is new somewhere during the 1.7.x series; it will cause 
a syntax error in the 1.6 series)

will show you all the MCA params for the openib BTL.  The one you want for the 
openib BTL is:

mpirun --mca btl_openib_if_include 

With the usnic BTL, we allow you to specify interfaces via two different kinds 
of values:

mpirun --mca btl_usnic_if_include 

where interfaces can be:

usnic_X (e.g., usnic_0)
CIDR network address (e.g., 192.168.0.0/16)

>> Also, note that you seem to have missed a BTL: sm (shared memory).
> sm is the preferred BTL to use for same-server communication.
> 
> This may be because several FAQs skip the sm BTL, even when it would
> be an appropriate/recommended choice to include in the BTL list.
> For instance:
> 
> http://www.open-mpi.org/faq/?category=all#selecting-components

This one seems to be ok.  I think the item you're referring to in that entry is 
an example of the ^ negation operator.

> http://www.open-mpi.org/faq/?category=all#tcp-selection

Fixed.  Thanks!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] efficient strategy with temporary message copy

2014-03-17 Thread christophe petit
Thanks Jeff, I understand better the different cases and how to choose as a
function of the situation


2014-03-17 16:31 GMT+01:00 Jeff Squyres (jsquyres) :

> On Mar 16, 2014, at 10:24 PM, christophe petit <
> christophe.peti...@gmail.com> wrote:
>
> > I am studying the optimization strategy when the number of communication
> functions in a codeis high.
> >
> > My courses on MPI say two things for optimization which are
> contradictory :
> >
> > 1*) You have to use temporary message copy to allow non-blocking sending
> and uncouple the sending and receiving
>
> There's a lot of schools of thought here, and the real answer is going to
> depend on your application.
>
> If the message is "short" (and the exact definition of "short" depends on
> your platform -- it varies depending on your CPU, your memory, your
> CPU/memory interconnect, ...etc.), then copying to a pre-allocated bounce
> buffer is typically a good idea.  That lets you keep using your "real"
> buffer and not have to wait until communication is done.
>
> For "long" messages, the equation is a bit different.  If "long" isn't
> "enormous", you might be able to have N buffers available, and simply work
> on 1 of them at a time in your main application and use the others for
> ongoing non-blocking communication.  This is sometimes called "shadow"
> copies, or "ghost" copies.
>
> Such shadow copies are most useful when you receive something each
> iteration, for example.  For example, something like this:
>
>   buffer[0] = malloc(...);
>   buffer[1] = malloc(...);
>   current = 0;
>   while (still_doing_iterations) {
>   MPI_Irecv(buffer[current], ..., &req);
>   /// work on buffer[current - 1]
>   MPI_Wait(req, MPI_STATUS_IGNORE);
>   current = 1 - current;
>   }
>
> You get the idea.
>
> > 2*) Avoid using temporary message copy because the copy will add extra
> cost on execution time.
>
> It will, if the memcpy cost is significant (especially compared to the
> network time to send it).  If the memcpy is small/insignificant, then don't
> worry about it.
>
> You'll need to determine where this crossover point is, however.
>
> Also keep in mind that MPI and/or the underlying network stack will likely
> be doing these kinds of things under the covers for you.  Indeed, if you
> send short messages -- even via MPI_SEND -- it may return "immediately",
> indicating that MPI says it's safe for you to use the send buffer.  But
> that doesn't mean that the message has even actually left the current
> server and gone out onto the network yet (i.e., some other layer below you
> may have just done a memcpy because it was a short message, and the
> processing/sending of that message is still ongoing).
>
> > And then, we are adviced to do :
> >
> > - replace MPI_SEND by MPI_SSEND (synchroneous blocking sending) : it is
> said that execution is divided by a factor 2
>
> This very, very much depends on your application.
>
> MPI_SSEND won't return until the receiver has started to receive the
> message.
>
> For some communication patterns, putting in this additional level of
> synchronization is helpful -- it keeps all MPI processes in tighter
> synchronization and you might experience less jitter, etc.  And therefore
> overall execution time is faster.
>
> But for others, it adds unnecessary delay.
>
> I'd say it's an over-generalization that simply replacing MPI_SEND with
> MPI_SSEND always reduces execution time by 2.
>
> > - use MPI_ISSEND and MPI_IRECV with MPI_WAIT function to synchronize
> (synchroneous non-blocking sending) : it is said that execution is divided
> by a factor 3
>
> Again, it depends on the app.  Generally, non-blocking communication is
> better -- *if your app can effectively overlap communication and
> computation*.
>
> If your app doesn't take advantage of this overlap, then you won't see
> such performance benefits.  For example:
>
>MPI_Isend(buffer, ..., req);
>MPI_Wait(&req, ...);
>
> Technically, the above uses ISEND and WAIT... but it's actually probably
> going to be *slower* than using MPI_SEND because you've made multiple
> function calls with no additional work between the two -- so the app didn't
> effectively overlap the communication with any local computation.  Hence:
> no performance benefit.
>
> > So what's the best optimization ? Do we have to use temporary message
> copy or not and if yes, what's the case for ?
>
> As you can probably see from my text above, the answer is: it depends.  :-)
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] OpenMpi-java Examples

2014-03-17 Thread Oscar Vega-Gisbert

Hi Madhurima,

Currently we only have tests which start MPI and check the provided  
level of thread support:


int provided = MPI.InitThread(args, MPI.THREAD_FUNNELED);

if(provided < MPI.THREAD_FUNNELED)
{
throw new MPIException("MPI_Init_thread returned less "+
   "than MPI_THREAD_FUNNELED.\n");
}

Regards,
Oscar

Quoting madhurima madhunapanthula :


hi,

Iam new to OpenMPI. I have installed the java bindings of OpenMPI and
running some samples in the cluster.
iam interested in some samples using THREAD_SERIALIZED and THREAD_FUNNLED
fields in OpenMPI. please provide me some samples.

--
Lokah samasta sukhinobhavanthu

Thanks,
Madhurima






This message was sent using IMP, the Internet Messaging Program.




[OMPI users] Usage of MPI_Win_create with MPI_Comm_Spawn

2014-03-17 Thread Ramesh Vinayagam
Hi,

Can comm_spawn be used with win_create?

For ex:
Master process:
---

  MPI_Comm_spawn(worker_program,MPI_ARGV_NULL, world_size-1,
 info, 0, MPI_COMM_SELF, &everyone,
 MPI_ERRCODES_IGNORE);

  MPI_Win_create(&testval, sizeof(double), 1,
 MPI_INFO_NULL, everyone,
 &nwin);


Worker process:


 MPI_Comm_get_parent(&parent);
  if (parent == MPI_COMM_NULL) error("No parent!");
  MPI_Comm_remote_size(parent, &size);
  if (size != 1) error("Something's wrong with the parent");

  MPI_Win_create(MPI_BOTTOM, 0,
 1, MPI_INFO_NULL,
 parent, &nwin);



This one fails currently. Am I doing something wrong. It would be great if
someone could help me.

Thanks
Ramesh


Re: [OMPI users] efficient strategy with temporary message copy

2014-03-17 Thread Saliya Ekanayake
Also, this presentation might be useful
http://extremecomputingtraining.anl.gov/files/2013/07/tuesday-slides2.pdf

Thank you,
Saliya
On Mar 17, 2014 2:18 PM, "christophe petit" 
wrote:

> Thanks Jeff, I understand better the different cases and how to choose as
> a function of the situation
>
>
> 2014-03-17 16:31 GMT+01:00 Jeff Squyres (jsquyres) :
>
>> On Mar 16, 2014, at 10:24 PM, christophe petit <
>> christophe.peti...@gmail.com> wrote:
>>
>> > I am studying the optimization strategy when the number of
>> communication functions in a codeis high.
>> >
>> > My courses on MPI say two things for optimization which are
>> contradictory :
>> >
>> > 1*) You have to use temporary message copy to allow non-blocking
>> sending and uncouple the sending and receiving
>>
>> There's a lot of schools of thought here, and the real answer is going to
>> depend on your application.
>>
>> If the message is "short" (and the exact definition of "short" depends on
>> your platform -- it varies depending on your CPU, your memory, your
>> CPU/memory interconnect, ...etc.), then copying to a pre-allocated bounce
>> buffer is typically a good idea.  That lets you keep using your "real"
>> buffer and not have to wait until communication is done.
>>
>> For "long" messages, the equation is a bit different.  If "long" isn't
>> "enormous", you might be able to have N buffers available, and simply work
>> on 1 of them at a time in your main application and use the others for
>> ongoing non-blocking communication.  This is sometimes called "shadow"
>> copies, or "ghost" copies.
>>
>> Such shadow copies are most useful when you receive something each
>> iteration, for example.  For example, something like this:
>>
>>   buffer[0] = malloc(...);
>>   buffer[1] = malloc(...);
>>   current = 0;
>>   while (still_doing_iterations) {
>>   MPI_Irecv(buffer[current], ..., &req);
>>   /// work on buffer[current - 1]
>>   MPI_Wait(req, MPI_STATUS_IGNORE);
>>   current = 1 - current;
>>   }
>>
>> You get the idea.
>>
>> > 2*) Avoid using temporary message copy because the copy will add extra
>> cost on execution time.
>>
>> It will, if the memcpy cost is significant (especially compared to the
>> network time to send it).  If the memcpy is small/insignificant, then don't
>> worry about it.
>>
>> You'll need to determine where this crossover point is, however.
>>
>> Also keep in mind that MPI and/or the underlying network stack will
>> likely be doing these kinds of things under the covers for you.  Indeed, if
>> you send short messages -- even via MPI_SEND -- it may return
>> "immediately", indicating that MPI says it's safe for you to use the send
>> buffer.  But that doesn't mean that the message has even actually left the
>> current server and gone out onto the network yet (i.e., some other layer
>> below you may have just done a memcpy because it was a short message, and
>> the processing/sending of that message is still ongoing).
>>
>> > And then, we are adviced to do :
>> >
>> > - replace MPI_SEND by MPI_SSEND (synchroneous blocking sending) : it is
>> said that execution is divided by a factor 2
>>
>> This very, very much depends on your application.
>>
>> MPI_SSEND won't return until the receiver has started to receive the
>> message.
>>
>> For some communication patterns, putting in this additional level of
>> synchronization is helpful -- it keeps all MPI processes in tighter
>> synchronization and you might experience less jitter, etc.  And therefore
>> overall execution time is faster.
>>
>> But for others, it adds unnecessary delay.
>>
>> I'd say it's an over-generalization that simply replacing MPI_SEND with
>> MPI_SSEND always reduces execution time by 2.
>>
>> > - use MPI_ISSEND and MPI_IRECV with MPI_WAIT function to synchronize
>> (synchroneous non-blocking sending) : it is said that execution is divided
>> by a factor 3
>>
>> Again, it depends on the app.  Generally, non-blocking communication is
>> better -- *if your app can effectively overlap communication and
>> computation*.
>>
>> If your app doesn't take advantage of this overlap, then you won't see
>> such performance benefits.  For example:
>>
>>MPI_Isend(buffer, ..., req);
>>MPI_Wait(&req, ...);
>>
>> Technically, the above uses ISEND and WAIT... but it's actually probably
>> going to be *slower* than using MPI_SEND because you've made multiple
>> function calls with no additional work between the two -- so the app didn't
>> effectively overlap the communication with any local computation.  Hence:
>> no performance benefit.
>>
>> > So what's the best optimization ? Do we have to use temporary message
>> copy or not and if yes, what's the case for ?
>>
>> As you can probably see from my text above, the answer is: it depends.
>>  :-)
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> users mail

[OMPI users] another corner case hangup in openmpi-1.7.5rc3

2014-03-17 Thread tmishima

Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3.

Condition:
1. allocate some nodes using RM such as TORQUE.
2. request the head node only in executing the job with
   -host or -hostfile option.

Example:
1. allocate node05,node06 using TORQUE.
2. request node05 only with -host option

[mishima@manage ~]$ qsub -I -l nodes=node05+node06
qsub: waiting for job 8661.manage.cluster to start
qsub: job 8661.manage.cluster ready

[mishima@node05 ~]$ cat $PBS_NODEFILE
node05
node06
[mishima@node05 ~]$ mpirun -np 1 -host node05 ~/mis/openmpi/demos/myprog
<< hang here >>

And, my fix for plm_base_launch_support.c is as follows:
--- plm_base_launch_support.c   2014-03-12 05:51:45.0 +0900
+++ plm_base_launch_support.try.c   2014-03-18 08:38:03.0 +0900
@@ -1662,7 +1662,11 @@
 OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
  "%s plm:base:setup_vm only HNP left",
  ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
+/* cleanup */
 OBJ_DESTRUCT(&nodes);
+/* mark that the daemons have reported so we can proceed */
+daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED;
+daemons->updated = false;
 return ORTE_SUCCESS;
 }

Tetsuya



Re: [OMPI users] another corner case hangup in openmpi-1.7.5rc3

2014-03-17 Thread Ralph Castain
Hmm...no, I don't think that's the correct patch. We want that function to 
remain "clean" as it's job is simply to construct the list of nodes for the VM. 
It's the responsibility of the launcher to decide what to do with it.

Please see https://svn.open-mpi.org/trac/ompi/ticket/4408 for a fix

Ralph

On Mar 17, 2014, at 5:40 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3.
> 
> Condition:
> 1. allocate some nodes using RM such as TORQUE.
> 2. request the head node only in executing the job with
>   -host or -hostfile option.
> 
> Example:
> 1. allocate node05,node06 using TORQUE.
> 2. request node05 only with -host option
> 
> [mishima@manage ~]$ qsub -I -l nodes=node05+node06
> qsub: waiting for job 8661.manage.cluster to start
> qsub: job 8661.manage.cluster ready
> 
> [mishima@node05 ~]$ cat $PBS_NODEFILE
> node05
> node06
> [mishima@node05 ~]$ mpirun -np 1 -host node05 ~/mis/openmpi/demos/myprog
> << hang here >>
> 
> And, my fix for plm_base_launch_support.c is as follows:
> --- plm_base_launch_support.c   2014-03-12 05:51:45.0 +0900
> +++ plm_base_launch_support.try.c   2014-03-18 08:38:03.0 +0900
> @@ -1662,7 +1662,11 @@
> OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
>  "%s plm:base:setup_vm only HNP left",
>  ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
> +/* cleanup */
> OBJ_DESTRUCT(&nodes);
> +/* mark that the daemons have reported so we can proceed */
> +daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED;
> +daemons->updated = false;
> return ORTE_SUCCESS;
> }
> 
> Tetsuya
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] another corner case hangup in openmpi-1.7.5rc3

2014-03-17 Thread tmishima


I do not understand your fix yet, but it would be better, I guess.

I'll check it later, but now please let me expalin what I thought:

If some nodes are allocated, it doen't go through this part because
opal_list_get_size(&nodes) > 0 at this location.

1590if (0 == opal_list_get_size(&nodes)) {
1591OPAL_OUTPUT_VERBOSE((5,
orte_plm_base_framework.framework_output,
1592 "%s plm:base:setup_vm only HNP in
allocation",
1593 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
1594/* cleanup */
1595OBJ_DESTRUCT(&nodes);
1596/* mark that the daemons have reported so we can proceed */
1597daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED;
1598 daemons->updated = false;
1599return ORTE_SUCCESS;
1600}

After filtering, opal_list_get_size(&nodes) becomes zero at this location.
That's why I think I should add two lines 1597,1598 to the if-clause below.

1660if (0 == opal_list_get_size(&nodes)) {
1661OPAL_OUTPUT_VERBOSE((5,
orte_plm_base_framework.framework_output,
1662 "%s plm:base:setup_vm only HNP left",
1663 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
1664OBJ_DESTRUCT(&nodes);
1665return ORTE_SUCCESS;

Tetsuya

> Hmm...no, I don't think that's the correct patch. We want that function
to remain "clean" as it's job is simply to construct the list of nodes for
the VM. It's the responsibility of the launcher to
> decide what to do with it.
>
> Please see https://svn.open-mpi.org/trac/ompi/ticket/4408 for a fix
>
> Ralph
>
> On Mar 17, 2014, at 5:40 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> > Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3.
> >
> > Condition:
> > 1. allocate some nodes using RM such as TORQUE.
> > 2. request the head node only in executing the job with
> >   -host or -hostfile option.
> >
> > Example:
> > 1. allocate node05,node06 using TORQUE.
> > 2. request node05 only with -host option
> >
> > [mishima@manage ~]$ qsub -I -l nodes=node05+node06
> > qsub: waiting for job 8661.manage.cluster to start
> > qsub: job 8661.manage.cluster ready
> >
> > [mishima@node05 ~]$ cat $PBS_NODEFILE
> > node05
> > node06
> > [mishima@node05 ~]$ mpirun -np 1 -host node05
~/mis/openmpi/demos/myprog
> > << hang here >>
> >
> > And, my fix for plm_base_launch_support.c is as follows:
> > --- plm_base_launch_support.c   2014-03-12 05:51:45.0 +0900
> > +++ plm_base_launch_support.try.c   2014-03-18 08:38:03.0
+0900
> > @@ -1662,7 +1662,11 @@
> > OPAL_OUTPUT_VERBOSE((5,
orte_plm_base_framework.framework_output,
> >  "%s plm:base:setup_vm only HNP left",
> >  ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
> > +/* cleanup */
> > OBJ_DESTRUCT(&nodes);
> > +/* mark that the daemons have reported so we can proceed */
> > +daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED;
> > +daemons->updated = false;
> > return ORTE_SUCCESS;
> > }
> >
> > Tetsuya
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Open MPI 1.7.4 with --enable-mpi-thread-multiple gives MPI_Recv error

2014-03-17 Thread Elias Rudberg

Hello,

Gustavo Correa wrote:

I guess you need to provide buffers of char type to
MPI_Send and MPI_Recv, not NULL.


That was not the problem, I was anyway using message size 0, so then  
it should be OK to give NULL as the buffer pointer.


I did find the problem now; it turns out that this was not at all due  
to any bug in Open MPI, it was my program that had a bug; I used wrong  
constant specifying the datatype. I used MPI_CHARACTER which I thought  
would correspond to a char or unsigned char in C/C++. But now when I  
checked the MPI standard it says that MPI_CHARACTER is for the Fortran  
CHARACTER type. Since I am using C, not Fortran, I should use MPI_CHAR  
or MPI_SIGNED_CHAR or MPI_UNSIGNED_CHAR. Now I have corrected my  
program by changing MPI_CHARACTER to MPI_UNSIGNED_CHAR, and then it  
works.


Sorry for reporting this as a bug in Open MPI, it was really a bug in  
my own code.


/ Elias


Quoting Gustavo Correa :


I guess you need to provide buffers of char type to
MPI_Send and MPI_Recv, not NULL.

On Mar 16, 2014, at 8:04 PM, Elias Rudberg wrote:


Hi Ralph,

Thanks for the quick answer!

Try running the "ring" program in our example directory and see if  
that works


I just did this, and it works. (I ran ring_c.c)

Looking in your ring_c.c code, I see that it is quite similar to my  
test program but one thing that differs is the datatype: the ring  
program uses MPI_INT but my test uses MPI_CHARACTER.
I tried changing from MPI_INT to MPI_CHARACTER in ring_c.c (and the  
type of the variable "message" from int to char), and then ring_c.c  
fails in the same way as my test code. And my code works if  
changing from MPI_CHARACTER to MPI_INT.


So, it looks like the there is a bug that is triggered when using  
MPI_CHARACTER, but it works with MPI_INT.


/ Elias


Quoting Ralph Castain :

Try running the "ring" program in our example directory and see if  
that works


On Mar 16, 2014, at 4:26 PM, Elias Rudberg  wrote:


Hello!

I would like to report a bug in Open MPI 1.7.4 when compiled with  
--enable-mpi-thread-multiple.


The bug can be reproduced with the following test program  
(mpi-send-recv.c):

===
#include 
#include 
int main() {
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("Rank %d at start\n", rank);
if (rank)
  MPI_Send(NULL, 0, MPI_CHARACTER, 0, 0, MPI_COMM_WORLD);
else
  MPI_Recv(NULL, 0, MPI_CHARACTER, 1, 0, MPI_COMM_WORLD,  
MPI_STATUS_IGNORE);

printf("Rank %d at end\n", rank);
MPI_Finalize();
return 0;
}
===

With Open MPI 1.7.4 compiled with --enable-mpi-thread-multiple,  
the test program above fails like this:

$ mpirun -np 2 ./a.out
Rank 0 at start
Rank 1 at start
[elias-p6-2022scm:2743] *** An error occurred in MPI_Recv
[elias-p6-2022scm:2743] *** reported by process  
[140733606985729,140256452018176]

[elias-p6-2022scm:2743] *** on communicator MPI_COMM_WORLD
[elias-p6-2022scm:2743] *** MPI_ERR_TYPE: invalid datatype
[elias-p6-2022scm:2743] *** MPI_ERRORS_ARE_FATAL (processes in  
this communicator will now abort,

[elias-p6-2022scm:2743] ***and potentially your MPI job)

Steps I use to reproduce this in Ubuntu:

(1) Download openmpi-1.7.4.tar.gz

(2) Configure like this:
./configure --enable-mpi-thread-multiple

(3) make

(4) Compile test program like this:
mpicc mpi-send-recv.c

(5) Run like this:
mpirun -np 2 ./a.out
This gives the error above.

Of course, in my actual application I will want to call  
MPI_Init_thread with MPI_THREAD_MULTIPLE instead of just  
MPI_Init, but that does not seem to matter for this error; the  
same error comes regardless of the way I call  
MPI_Init/MPI_Init_thread. So I just put MPI_Init in the test code  
above to make it as short as possible.


Do you agree that this is a bug, or am I doing something wrong?

Any ideas for workarounds to make things work with  
--enable-mpi-thread-multiple? (I do need threads, so skipping  
--enable-mpi-thread-multiple is probably not an option for me.)


Best regards,
Elias


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







Re: [OMPI users] another corner case hangup in openmpi-1.7.5rc3

2014-03-17 Thread Ralph Castain
Understood, and your logic is correct. It's just that I'd rather each launcher 
decide to declare the daemons as reported rather than doing it in the common 
code, just in case someone writes a launcher where they choose to respond 
differently to the case where no new daemons need to be launched.


On Mar 17, 2014, at 6:43 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> I do not understand your fix yet, but it would be better, I guess.
> 
> I'll check it later, but now please let me expalin what I thought:
> 
> If some nodes are allocated, it doen't go through this part because
> opal_list_get_size(&nodes) > 0 at this location.
> 
> 1590if (0 == opal_list_get_size(&nodes)) {
> 1591OPAL_OUTPUT_VERBOSE((5,
> orte_plm_base_framework.framework_output,
> 1592 "%s plm:base:setup_vm only HNP in
> allocation",
> 1593 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
> 1594/* cleanup */
> 1595OBJ_DESTRUCT(&nodes);
> 1596/* mark that the daemons have reported so we can proceed */
> 1597daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED;
> 1598   daemons->updated = false;
> 1599return ORTE_SUCCESS;
> 1600}
> 
> After filtering, opal_list_get_size(&nodes) becomes zero at this location.
> That's why I think I should add two lines 1597,1598 to the if-clause below.
> 
> 1660if (0 == opal_list_get_size(&nodes)) {
> 1661OPAL_OUTPUT_VERBOSE((5,
> orte_plm_base_framework.framework_output,
> 1662 "%s plm:base:setup_vm only HNP left",
> 1663 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
> 1664OBJ_DESTRUCT(&nodes);
> 1665return ORTE_SUCCESS;
> 
> Tetsuya
> 
>> Hmm...no, I don't think that's the correct patch. We want that function
> to remain "clean" as it's job is simply to construct the list of nodes for
> the VM. It's the responsibility of the launcher to
>> decide what to do with it.
>> 
>> Please see https://svn.open-mpi.org/trac/ompi/ticket/4408 for a fix
>> 
>> Ralph
>> 
>> On Mar 17, 2014, at 5:40 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3.
>>> 
>>> Condition:
>>> 1. allocate some nodes using RM such as TORQUE.
>>> 2. request the head node only in executing the job with
>>>  -host or -hostfile option.
>>> 
>>> Example:
>>> 1. allocate node05,node06 using TORQUE.
>>> 2. request node05 only with -host option
>>> 
>>> [mishima@manage ~]$ qsub -I -l nodes=node05+node06
>>> qsub: waiting for job 8661.manage.cluster to start
>>> qsub: job 8661.manage.cluster ready
>>> 
>>> [mishima@node05 ~]$ cat $PBS_NODEFILE
>>> node05
>>> node06
>>> [mishima@node05 ~]$ mpirun -np 1 -host node05
> ~/mis/openmpi/demos/myprog
>>> << hang here >>
>>> 
>>> And, my fix for plm_base_launch_support.c is as follows:
>>> --- plm_base_launch_support.c   2014-03-12 05:51:45.0 +0900
>>> +++ plm_base_launch_support.try.c   2014-03-18 08:38:03.0
> +0900
>>> @@ -1662,7 +1662,11 @@
>>>OPAL_OUTPUT_VERBOSE((5,
> orte_plm_base_framework.framework_output,
>>> "%s plm:base:setup_vm only HNP left",
>>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
>>> +/* cleanup */
>>>OBJ_DESTRUCT(&nodes);
>>> +/* mark that the daemons have reported so we can proceed */
>>> +daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED;
>>> +daemons->updated = false;
>>>return ORTE_SUCCESS;
>>>}
>>> 
>>> Tetsuya
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users