[OMPI users] Making MPI_Send to behave as blocking for all the sizes of the messages

2010-11-18 Thread George Markomanolis

Dear all,

I am trying to disable the eager mode in OpenMPI 1.3.3 and I don't see a 
real difference between the timings.
I would like to execute a ping (rank 0 sends a message to rank 1) and to 
measure the duration of the MPI_Send on rank 0 and the duration of 
MPI_Recv on rank 1. I have the following results.


Without changing the eager mode:

bytesMPI_Send (in msec)MPI_Recv (in msec)
15.8  52.2
25.6  51.0
45.4  51.1
85.6  51.6
16   5.5  49.7
32   5.4  52.1
64   5.3  53.3



with disabled the eager mode:

ompi_info --param btl tcp | grep eager
MCA btl: parameter "btl_tcp_eager_limit" (current value: "0", data 
source: environment)


bytesMPI_Send (in msec)MPI_Recv (in msec)
15.4  52.3
25.4  51.0
45.4  52.1
85.4  50.7
16   5.0  50.2
32   5.1  50.1
64   5.4  52.8

However I was expecting that with disabled the eager mode the duration 
of MPI_Send should be longer. Am I wrong? Is there any option for making 
the MPI_Send to behave like blocking command for all the sizes of the 
messages?



Thanks a lot,
Best regards,
George Markomanolis



[OMPI users] tool for measuring the ping with accuracy

2010-11-21 Thread George Markomanolis

Dear Eugene,

Thanks a lot for the answer you were right for the eager mode.

I have one more question. I am looking for an official tool to measure 
the ping time, just sending a message of 1 byte or more and measure the 
duration of the MPI_Send command on the rank 0 and the duration of the 
MPI_Recv on rank 1. I would like to know any formal tool because I am 
using also SkaMPI and the results really depend on the call of the 
synchronization before the measurement starts.


So for example with synchronizing the processors, sending 1 byte, I have:
rank 0, MPI_Send: ~7 ms
rank 1, MPI_Recv: ~52 ms

where 52 ms is almost the half of the ping-pong and this is ok.

Without synchronizing I have:
rank 0, MPI_Send: ~7 ms
rank 1, MPI_Recv: ~7 ms

However I developed a simple application where the rank 0 sends 1000 
messages of 1 byte to rank 1 and I have almost the second timings with 
the 7 ms. If in the same application I add the MPI_Recv and MPI_Send 
respectively in order to have a ping-pong application then the ping-pong 
duration is 100ms (like SkaMPI). Can someone explain me why is this 
happening? The ping-pong takes 100 ms and the ping without 
synchronization takes 7 ms.


Thanks a lot,
Best regards,
George Markomanolis



Message: 1
Date: Thu, 18 Nov 2010 10:31:40 -0800
From: Eugene Loh 
Subject: Re: [OMPI users] Making MPI_Send to behave as blocking for
all the sizes of the messages
To: Open MPI Users 
Message-ID: <4ce5710c.8030...@oracle.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Try lowering the eager threshold more gradually... e.g., 4K, 2K, 1K, 
512, etc. -- and watch what happens.  I think you will see what you 
expect, except once you get too small then the value is ignored 
entirely.  So, the setting just won't work at the extreme value (0) you 
want.


Maybe the thing to do is convert your MPI_Send calls to MPI_Ssend 
calls.  Or, compile in a wrapper that intercepts MPI_Send calls and 
implements them by calling PMPI_Ssend.


George Markomanolis wrote:

  

Dear all,

I am trying to disable the eager mode in OpenMPI 1.3.3 and I don't see 
a real difference between the timings.
I would like to execute a ping (rank 0 sends a message to rank 1) and 
to measure the duration of the MPI_Send on rank 0 and the duration of 
MPI_Recv on rank 1. I have the following results.


Without changing the eager mode:

bytesMPI_Send (in msec)MPI_Recv (in msec)
15.8  52.2
25.6  51.0
45.4  51.1
85.6  51.6
16   5.5  49.7
32   5.4  52.1
64   5.3  53.3



with disabled the eager mode:

ompi_info --param btl tcp | grep eager
MCA btl: parameter "btl_tcp_eager_limit" (current value: "0", data 
source: environment)


bytesMPI_Send (in msec)MPI_Recv (in msec)
15.4  52.3
25.4  51.0
45.4  52.1
85.4  50.7
16   5.0  50.2
32   5.1  50.1
64   5.4  52.8

However I was expecting that with disabled the eager mode the duration 
of MPI_Send should be longer. Am I wrong? Is there any option for 
making the MPI_Send to behave like blocking command for all the sizes 
of the messages?



Thanks a lot,
Best regards,
George Markomanolis

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] users Digest, Vol 1750, Issue 1

2010-11-25 Thread George Markomanolis



Message: 2
Date: Tue, 23 Nov 2010 10:27:37 -0800
From: Eugene Loh 
Subject: Re: [OMPI users] tool for measuring the ping with accuracy
To: Open MPI Users 
Message-ID: <4cec0799.60...@oracle.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

George Markomanolis wrote:

  

Dear Eugene,

Thanks a lot for the answer you were right for the eager mode.

I have one more question. I am looking for an official tool to measure 
the ping time, just sending a message of 1 byte or more and measure 
the duration of the MPI_Send command on the rank 0 and the duration of 
the MPI_Recv on rank 1. I would like to know any formal tool because I 
am using also SkaMPI and the results really depend on the call of the 
synchronization before the measurement starts.


So for example with synchronizing the processors, sending 1 byte, I have:
rank 0, MPI_Send: ~7 ms
rank 1, MPI_Recv: ~52 ms

where 52 ms is almost the half of the ping-pong and this is ok.

Without synchronizing I have:
rank 0, MPI_Send: ~7 ms
rank 1, MPI_Recv: ~7 ms

However I developed a simple application where the rank 0 sends 1000 
messages of 1 byte to rank 1 and I have almost the second timings with 
the 7 ms. If in the same application I add the MPI_Recv and MPI_Send 
respectively in order to have a ping-pong application then the 
ping-pong duration is 100ms (like SkaMPI). Can someone explain me why 
is this happening? The ping-pong takes 100 ms and the ping without 
synchronization takes 7 ms.



I'm not convinced I'm following you at all.  Maybe the following helps, 
though maybe it's just obvious and misses the point you're trying to make.


In a ping-pong test, you have something like this:

tsend = MPI_Wtime()
MPI_Send
tsend = MPI_Wtime() - tsend
trecv = MPI_Wtime()
MPI_Recv
trecv = MPI_Wtime() - trecv

The send time times how long it takes to get the message out of the 
user's send buffer.  This time is very short.  In contrast, the 
"receive" time mostly measures how long it takes for the ping message to 
reach the peer and the pong message to return.  The actual time to do 
the receive processing is very short and accounts for a tiny fraction of 
trecv.


If a sender sends many short messages to a receiver and the two 
processes don't synchronize much, you can overlap many messages and hide 
the long transit time.


Here's a simple model:

sender injects message into interconnect, MPI_Send completes  (this time 
is short)

message travels the interconnect to the receiver (this time is long)
receiver unpacks the message and MPI_Recv completes (this time is short)

A ping-pong test counts the long inter-process transit time.  Sending 
many short messages before synchronizing hides the long transit time.


Sorry if this discussion misses the point you're trying to make.

  

Dear Eugene,

Thanks a lot, this was what I wanted to know. Now I understood it.

Best regards,
George Markomanolis


[OMPI users] Understanding the buffering of small messages with tcp network

2011-03-10 Thread George Markomanolis

Dear all,

I would like you to ask for a topic that there are already many 
questions but I am not familiar a lot with it. I want to understand the 
behaviour of an application where there are many messages less than 64KB 
(eager mode) and I use TCP network. I am trying to understand in order 
to simulate this application.
For example it can be possible to have one MPI_Send of 1200 bytes after 
some computation, then two messages of the same size, after computation, 
etc. However according to the measurements and the profiling the cost of 
the communication is less than the latency of the network. I can 
understand that the cost of the MPI_Send is the copy to the buffer 
however sending the message to the destination it should cost at least 
the latency. So are the messages buffered in the sender and they are 
sent as packet to the receiver? My tcp window is 4MB and I use the same 
value for snd_buff and rcv_buff. If they are buffered in the sender what 
is the criterion/algorithm? I mean if I have one message, after 
computation and after again message is it possible these two messages to 
be buffered from the sender point of view or this happens only on the 
receiver? If there is any document/paper that I can read about this I 
would be appreciate to provide me the link.
A simple example is that if I have a loop that rank 0 sends two messages 
to rank 1 then the duration of the first message is bigger than the 
second's one and if I increase the loop to 10 or 20 messages then all 
the messages cost a lot less than the first one and also less from what 
SkaMPI measures. So I am sure that it should be a buffer issue (or 
something else that I can't think about).


Best regards,
Georges


[OMPI users] Question about oversubscribing

2012-11-04 Thread George Markomanolis

Dear all,

I am trying to execute an experiment by oversubscribing the nodes. So I 
have available some clusters (I can use up to 8-10 different clusters 
during one execution) and I have totally around to 1300 cores. I am 
executing the EP benchmark from the NAS suite which means that there are 
not a lot of MPI messages, just some collective MPI calls.


The number of the MPI processes per node, depends on the available 
memory of each node. Thus in the machinefile I have declared one node 13 
times if I want 13 MPI processes on it. Is that correct? Giving a 
machinefile of 32768 nodes when I want to execute 32768 processes, does 
OpenMPI behave like there is no oversubscribing? If yes how can I give a 
machinefile where there is different number of MPI processes on each 
node? The maximum number of MPI processes that I have in a node is 388.


My problem is that I can execute 16384 processes but not 32768. In the 
first case I need around to 3 minutes for the execution but in the 
second case, even after 7 hours the benchmark does not even start. There 
is no error, I am just cancelling the job by myself but I am assuming 
that something is wrong because 7 hours it is too much. I have to say 
that I executed the instance of 16384 processes without any problem. I 
added some debug info in the benchmark and I can see that the execution 
is delayed during MPI_Init, it never passes this point. For the instance 
of 16384 processes I need around to 2 minutes to finish the MPI_Init 
call. I am checking the memory of all the nodes and there is at least 
0.5GB free memory on each node.


I know about the parameter mpi_yield_when_idle but I have read that if 
there are not a lot of MPI messages will not improve the performance. I 
tried though and nothing changed. I tried also the mpi_preconnect_mpi 
just in case but again nothing. Could you please indicate a reason why 
is this happening?


Moreover I used just one node with 48GB memory in order to execute 2048 
MPI processes without any problem, of course I just had to wait a lot.


I am using OpenMPI v1.4.1 and all the clusters are 64 bit.

I execute the benchmark with the following command:
mpirun --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_exclude 
ib0,lo,myri0 -machinefile machines -np 32768 ep.D.32768


Best regards,
George Markomanolis


Re: [OMPI users] Question about oversubscribing

2012-11-04 Thread George Markomanolis

Dear Ralph,

I am copying your email from the web site because I had enabled the 
option to receive all the emails once per day



On 11/04/2012 05:27 PM, George Markomanolis wrote:

> Dear all,

>
> I am trying to execute an experiment by oversubscribing the nodes. So I 
have available some clusters (I can use up to 8-10 different clusters 
during one execution) and I have totally around to 1300 cores. I am 
executing the EP benchmark from the NAS suite which means that there 
are not a lot of MPI messages, just some collective MPI calls.

>
> The number of the MPI processes per node, depends on the available 
memory of each node. Thus in the machinefile I have declared one node 
13 times if I want 13 MPI processes on it. Is that correct?


You *can* do it that way, or you could just use "slots=13" for that 
node in the file, and list it only once.



OK, but I assume the result is the same, right?


> Giving a machinefile of 32768 nodes when I want to execute 32768 processes, does OpenMPI 
behave like there is no oversubscribing?


Yes, it should - I assume you mean "slots" and not "nodes" in the 
above statement, since you indicate that you listed each node multiple 
times to set the number of slots on that node.



Yes, I mean slots.


> If yes how can I give a machinefile where there is different number of MPI processes on each 
node? The maximum number of MPI processes that I have in a node is 388.


Just assign the number of slots on each node to be the number of 
processes you want on that node



OK


>
> My problem is that I can execute 16384 processes but not 32768. In 
the first case I need around to 3 minutes for the execution but in the 
second case, even after 7 hours the benchmark does not even start. 
There is no error, I am just cancelling the job by myself but I am 
assuming that something is wrong because 7 hours it is too much. I 
have to say that I executed the instance of 16384 processes without 
any problem. I added some debug info in the benchmark and I can see 
that the execution is delayed during MPI_Init, it never passes this 
point. For the instance of 16384 processes I need around to 2 minutes 
to finish the MPI_Init call. I am checking the memory of all the nodes 
and there is at least 0.5GB free memory on each node.

>
> I know about the parameter mpi_yield_when_idle but I have read that if 
there are not a lot of MPI messages will not improve the performance. 
I tried though and nothing changed. I tried also the 
mpi_preconnect_mpi just in case but again nothing. Could you please 
indicate a reason why is this happening?


You indicated that these jobs are actually spanning multiple clusters 
- true? If so, when you cross that 16384 boundary, do you also cross 
clusters? Is it possible one or more of the additional clusters is 
blocking communications?


I have tried both configurations even using exactly the same nodes with 
less MPI processes per node in order to check if a site is blocking the 
rest ones and I have tried the half machinefile for the instance of 
16384 in order to see if there is any issue by using so many MPI 
processes per node. Both were executed fine with the instance of 16384 
MPI processes. Also I tried to combine different quarters of the 
machinefile in order to check if there is any issue with the combination 
of specific sites and again I didn't have a problem.


>
> Moreover I used just one node with 48GB memory in order to execute 
2048 MPI processes without any problem, of course I just had to wait a 
lot.

>
> I am using OpenMPI v1.4.1 and all the clusters are 64 bit.
>
> I execute the benchmark with the following command:
> mpirun --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_exclude 
ib0,lo,myri0 -machinefile machines -np 32768 ep.D.32768


You could just leave off the "-np N" part of the command line - we'll 
assign one process to every slot specified in the machinefile.



OK, nice

Best regards,
George Markomanolis


>
> Best regards,
> George Markomanolis
> ___
> users mailing list
> users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users






[OMPI users] Maximum number of MPI processes on a node + discovering faulty nodes

2012-11-26 Thread George Markomanolis

Dear all,

Initially I would like an advice of how to identify the maximum number 
of MPI processes that can be executed on a node with oversubscribing. 
When I try to execute an application with 4096 MPI processes on a 
24-cores node with 48GB of memory, I have an error "Unknown error: 1" 
while the memory is not even at the half. I can execute the same 
application with 2048 MPI processes in less than one minute. I have 
checked linux settings about maximum number of processes and it is much 
bigger than 4096.


Another more generic question, is about discovering nodes with faulty 
memory. Is there any way to identify nodes with faulty memory? I found 
accidentally that a node with exact the same hardware couldn't execute 
an MPI application when it was using more than 12GB of ram while the 
second one could use all of the 48GB of memory. If I have 500+ nodes is 
difficult to check all of them and I am not familiar with any efficient 
solution. Initially I thought about memtester but it takes a lot of 
time. I know that this does not apply exactly on this mailing list but I 
thought that maybe an OpenMPI user knows something about.



Best regards,
George Markomanolis


Re: [OMPI users] Maximum number of MPI processes on a node + discovering faulty nodes

2012-11-27 Thread George Markomanolis

Dear Ralph,

Thanks for the answer, I am using OMPI v1.4.1.

Best regards,
George Markomanolis

On 11/26/2012 05:07 PM, Ralph Castain wrote:

What version of OMPI are you using?

On Nov 26, 2012, at 1:02 AM, George Markomanolis  
wrote:


Dear all,

Initially I would like an advice of how to identify the maximum number of MPI processes 
that can be executed on a node with oversubscribing. When I try to execute an application 
with 4096 MPI processes on a 24-cores node with 48GB of memory, I have an error 
"Unknown error: 1" while the memory is not even at the half. I can execute the 
same application with 2048 MPI processes in less than one minute. I have checked linux 
settings about maximum number of processes and it is much bigger than 4096.

Another more generic question, is about discovering nodes with faulty memory. 
Is there any way to identify nodes with faulty memory? I found accidentally 
that a node with exact the same hardware couldn't execute an MPI application 
when it was using more than 12GB of ram while the second one could use all of 
the 48GB of memory. If I have 500+ nodes is difficult to check all of them and 
I am not familiar with any efficient solution. Initially I thought about 
memtester but it takes a lot of time. I know that this does not apply exactly 
on this mailing list but I thought that maybe an OpenMPI user knows something 
about.


Best regards,
George Markomanolis
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Maximum number of MPI processes on a node + discovering faulty nodes

2012-11-27 Thread George Markomanolis

Dear Jeff,

Of course I was thinking to execute memtester on each node on the same 
time and gather the outputs. However executing memtester on a node with 
48GB memory it takes a lot of time (more than 1-2 hours, I don't 
remember exactly, maybe even more because I cancelled its execution) and 
I have to consume resources just for testing. So I was curious if you 
know a tool/procedure that works much faster. Of course filling the 
memory with an application works also but I don't know how right it is.


Best regards,
George Markomanolis

On 11/26/2012 06:09 PM, Jeff Squyres wrote:

On Nov 26, 2012, at 4:02 AM, George Markomanolis wrote:


Another more generic question, is about discovering nodes with faulty memory. 
Is there any way to identify nodes with faulty memory? I found accidentally 
that a node with exact the same hardware couldn't execute an MPI application 
when it was using more than 12GB of ram while the second one could use all of 
the 48GB of memory. If I have 500+ nodes is difficult to check all of them and 
I am not familiar with any efficient solution. Initially I thought about 
memtester but it takes a lot of time. I know that this does not apply exactly 
on this mailing list but I thought that maybe an OpenMPI user knows something 
about.

You really do want something like a memory tester.  MPI applications *might* 
beat on your memory to identify errors, but that's really just a side effect of 
HPC access patterns.  You really want a dedicated memory tester.

If such a memory tester takes a long time, you might want to use mpirun to 
launch it on multiple nodes simultaneously to save some time...?





Re: [OMPI users] Maximum number of MPI processes on a node + discovering faulty nodes

2012-11-27 Thread George Markomanolis

Dear Ralph,

For the file descriptors the declared limit is over 65536 files but if 
OMPI needs several of them, then this can be interesting. Is there any 
source to read about it or I just should do trials? About the child 
processes again, can I do something? I have root access, so I can change 
the values.



Best regards,
George Markomanolis

On 11/27/2012 05:58 PM, Ralph Castain wrote:

Just glancing at the code, I don't see anything tied to 2**12 that pops out at 
me. I suspect the issue is that you are hitting a system limit on the number of 
child processes a process can spawn - this is different from the total number 
of processes allowed on the node - or the number of file descriptors a process 
can have open (we need several per process for I/O forwarding).


On Nov 27, 2012, at 8:24 AM, George Markomanolis  
wrote:


Dear Ralph,

Thanks for the answer, I am using OMPI v1.4.1.

Best regards,
George Markomanolis

On 11/26/2012 05:07 PM, Ralph Castain wrote:

What version of OMPI are you using?

On Nov 26, 2012, at 1:02 AM, George Markomanolis  
wrote:


Dear all,

Initially I would like an advice of how to identify the maximum number of MPI processes 
that can be executed on a node with oversubscribing. When I try to execute an application 
with 4096 MPI processes on a 24-cores node with 48GB of memory, I have an error 
"Unknown error: 1" while the memory is not even at the half. I can execute the 
same application with 2048 MPI processes in less than one minute. I have checked linux 
settings about maximum number of processes and it is much bigger than 4096.

Another more generic question, is about discovering nodes with faulty memory. 
Is there any way to identify nodes with faulty memory? I found accidentally 
that a node with exact the same hardware couldn't execute an MPI application 
when it was using more than 12GB of ram while the second one could use all of 
the 48GB of memory. If I have 500+ nodes is difficult to check all of them and 
I am not familiar with any efficient solution. Initially I thought about 
memtester but it takes a lot of time. I know that this does not apply exactly 
on this mailing list but I thought that maybe an OpenMPI user knows something 
about.


Best regards,
George Markomanolis
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







Re: [OMPI users] openmpi+torque: How run job in a subset of the allocation?

2013-11-28 Thread George Markomanolis

Hi,

Here is what I do to execute 20 mpirun calls using LSF and one job but 
it is similar for your case I assume.


I use $LSB_HOSTS to extract the hosts from the job. I know how many 
cores I want per job so I create machine files. For our application, 
each execution has its own nodes but the last MPI processes are in 
shared node. For example if I have two mpirun calls I need 40 cores (20 
cores each one). I use three nodes (16 cores per node). First mpirun 
call: first node + 0-3 core on the second node. Second mpirun call: 
third node + 4-7 core on the second node. I do this in order not to 
waste resources as I will need to execute ~20 mpirun calls not just two 
and also the last 4 MPI processes do different task from the first 16 ones.


So I create machine files like that:
rank 0=s15r1b45 slot=0
rank 1=s15r1b45 slot=1
rank 2=s15r1b45 slot=2
rank 3=s15r1b45 slot=3



Now from the root node execute multiple mpirun calls like:

mpirun    &

and after them use the command wait.

So you start many mpirun calls on the background and with the wait you 
are sure that the job will not be killed before the executions are finished.


Just be careful that the machine files do not include common resources 
(cores in my case).


Best regards,
George Markomanolis

On 11/27/2013 10:02 PM, Ralph Castain wrote:

I'm afraid the two solvers would be in the same comm_world if launched that way

Sent from my iPhone


On Nov 27, 2013, at 11:58 AM, Gus Correa  wrote:

Hi Ola, Ralph

I may be wrong, but I'd guess launching the two solvers
in MPMD/MIMD mode would work smoothly with the torque PBS_NODEFILE,
in a *single* Torque job.
If I understood Ola right, that is what he wants.

Say, something like this (for one 32-core node):

#PBS -l nodes=1:ppn=32
...
mpiexec -np 8 ./solver1 : -np 24 ./solver2

I am assuming the two executables never talk to each other, right?
They solve the same problem with different methods, in a completely
independent and "embarrassingly parallel" fashion, and could run
concurrently.

Is that right?
Or did I misunderstand Ola's description, and they work in a staggered sequence 
to each other?
[first s1, then s2, then s1 again, then s2 once more...]
I am a bit confused by Ola's use of the words "loosely coupled" in his 
description, which might indicate cooperation to solve the same problem,
rather than independent work on two instances of the same problem.

Ralph: Does the MPI model assume that MPMD/MIMD executables
have to necessarily communicate with each other,
or perhaps share a common MPI_COMM_WORLD?
[I guess not.]

Anyway, just a guess,
Gus Correa


On 11/27/2013 10:23 AM, Ralph Castain wrote:
Are you wanting to run the solvers on different nodes within the
allocation? Or on different cores across all nodes?

For different nodes, you can just use -host to specify which nodes you
want that specific mpirun to use, or a hostfile should also be fine. The
FAQ's comment was aimed at people who were giving us the PBS_NODEFILE as
the hostfile - which could confuse older versions of OMPI into using the
rsh launcher instead of Torque. Remember that we have the relative node
syntax so you don't actually have to name the nodes - helps if you want
to execute batch scripts and won't know the node names in advance.

For different cores across all nodes, you would need to use some binding
trickery that may not be in the 1.4 series, so you might need to update
to the 1.6 series. You have two options: (a) have Torque bind your
mpirun to specific cores (I believe it can do that), or (b) use
--slot-list to specify which cores that particular mpirun is to use. You
can then separate the two solvers but still run on all the nodes, if
that is of concern.

HTH
Ralph



On Wed, Nov 27, 2013 at 6:10 AM, mailto:ola.widl...@se.abb.com>> wrote:

Hi,

We have an in-house application where we run two solvers in a
loosely coupled manner: The first solver runs a timestep, then the
second solver does work on the same timestep, etc. As the two
solvers never execute at the same time, we would like to run the two
solvers in the same allocation (launching mpirun once for each of
them). RAM is not an issue, so there should not be any risk of
excessive swapping degrading performance.

We use openmpi-1.4.5 compiled with torque integration. The torque
integration means we do not give a hostfile to mpirun, it will
itself query torque for the allocation info.

Question:

Can we force one of the solvers to run in a *subset* of the full
allocation? How do we do that? I read in the FAQ that providing a
hostfile to mpirun in this case (when it's not needed due to torque
integration) would cause a lot of problems...

Thanks in advance,

Ola


___
users mailing list
us...@open-mpi.org <mailto:u

[OMPI users] question about algorithms for collective communication

2009-08-23 Thread George Markomanolis

Dear all,

I am trying to figure out the algorithms that are used for some 
collective communications (allreduce, bcast, alltoall). Is there any 
document to explain which algorithms are used? For example I would like 
to know exactly how the command allreduce is analyzed to send and receive.


Thanks a lot,
Best regards,
George Markomanolis


__ Information from ESET Smart Security, version of virus signature 
database 4360 (20090823) __

The message was checked by ESET Smart Security.

http://www.eset.com




[OMPI users] using specific algorithm for collective communication, and knowing the root cpu?

2009-11-02 Thread George Markomanolis

Dear all,

I would like to ask about collective communication. With debug mode 
enabled, I can see many info during the execution which algorithm is 
used etc. But my question is that I would like to use a specific 
algorithm (the simplest I suppose). I am profiling some applications and 
I want to simulate them with another program so I must be able to know 
for example what the mpi_allreduce is doing. I saw many algorithms that 
depend on the message size and the number of processors, so I would like 
to ask:


1) what is the way to say at open mpi to use a simple algorithm for 
allreduce (is there any way to say to use the simplest algorithm for all 
the collective communication?). Basically I would like to know the root 
cpu for every collective communication. What are the disadvantages for 
demanding the simplest algorithm?


2) Is there any overhead because I installed open mpi with debug mode 
even if I just run a program without any flag with --mca?


3) How you could describe allreduce by words? Can we say that the root 
cpu does reduce and then broadcast? I mean is that right for your 
implementation? I saw that it depends on the algorithm which cpu is the 
root, so is it possible to use an algorithm that I will know every time 
that cpu with rank 0 is the root?


Thanks a lot,
George


Re: [OMPI users] using specific algorithm for collective communication and knowing the root cpu?

2009-11-04 Thread George Markomanolis

Dear George,

Thanks for the answer,
I have some questions, because I am using some programs for profiling, 
when you say that the cost of allreduce raise you mean about the time 
only or also and the flops of this command? Is there some additional 
work added at the allreduce or it's only about time? During profiling I 
want to count the flops so if there is a small difference on timing 
because of debug mode and declaration of the allreduce algorithm is not 
so big deal, but if it changes also the flops then it is bad for me. 
When I executed a program with debug mode I saw that openmpi uses some 
algorithms and I looked at your code and I saw that rank 0 is not always 
the root cpu (if I understood right). Finally do you have any opinion 
about which is the best way to know the algorithm is used in collective 
communication and the root cpu of the communication?


Best regards,
George




Today's Topics:

   1. Re: using specific algorithm for collective   communication,
  and knowing the root cpu? (George Bosilca)


--

Message: 1
Date: Tue, 3 Nov 2009 12:09:18 -0500
From: George Bosilca 
Subject: Re: [OMPI users] using specific algorithm for collective
communication, and knowing the root cpu?
To: Open MPI Users 
Message-ID: 
Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes

You can add the following MCA parameters either on the command line or  
in the $(HOME)/.openmpi/mca-params.conf file.


On Nov 2, 2009, at 08:52 , George Markomanolis wrote:

  

Dear all,

I would like to ask about collective communication. With debug mode  
enabled, I can see many info during the execution which algorithm is  
used etc. But my question is that I would like to use a specific  
algorithm (the simplest I suppose). I am profiling some applications  
and I want to simulate them with another program so I must be able  
to know for example what the mpi_allreduce is doing. I saw many  
algorithms that depend on the message size and the number of  
processors, so I would like to ask:


1) what is the way to say at open mpi to use a simple algorithm for  
allreduce (is there any way to say to use the simplest algorithm for  
all the collective communication?). Basically I would like to know  
the root cpu for every collective communication. What are the  
disadvantages for demanding the simplest algorithm?



coll_tuned_use_dynamic_rules=1 to allow you to manually set the  
algorithms to be used.
coll_tuned_allreduce_algorithm=*something between 0 and 5* to describe  
the algorithm to be user. For the simplest algorithm I guess you will  
want to use 1 (star based fan-in fan-out).


The main disadvantage is that the cost of the allreduce will raise  
which will negatively impact the overall performance of the application.


  
2) Is there any overhead because I installed open mpi with debug  
mode even if I just run a program without any flag with --mca?



There are many overhead because you compile in debug mode. We do a lot  
of extra tracking of internally allocate memory, checks on most/all  
internal objects and so on. Based on previous results I would say your  
latency increase by about 2-3 micro-secs, but the impact on the  
bandwidth is minimal.


  
3) How you could describe allreduce by words? Can we say that the  
root cpu does reduce and then broadcast? I mean is that right for  
your implementation? I saw that it depends on the algorithm which  
cpu is the root, so is it possible to use an algorithm that I will  
know every time that cpu with rank 0 is the root?



Exactly, allreduce = reduce + bcast (and btw this is what the  
algorithm basic will do). However, there is no root in an allreduce as  
all processors execute symmetric work. Of course if one see the  
allreduce as a reduce followed by a broadcast then one has to select a  
root (I guess we pick the rank 0 in our implementation).


   george.

  

Thanks a lot,
George
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users