[OMPI users] Making MPI_Send to behave as blocking for all the sizes of the messages
Dear all, I am trying to disable the eager mode in OpenMPI 1.3.3 and I don't see a real difference between the timings. I would like to execute a ping (rank 0 sends a message to rank 1) and to measure the duration of the MPI_Send on rank 0 and the duration of MPI_Recv on rank 1. I have the following results. Without changing the eager mode: bytesMPI_Send (in msec)MPI_Recv (in msec) 15.8 52.2 25.6 51.0 45.4 51.1 85.6 51.6 16 5.5 49.7 32 5.4 52.1 64 5.3 53.3 with disabled the eager mode: ompi_info --param btl tcp | grep eager MCA btl: parameter "btl_tcp_eager_limit" (current value: "0", data source: environment) bytesMPI_Send (in msec)MPI_Recv (in msec) 15.4 52.3 25.4 51.0 45.4 52.1 85.4 50.7 16 5.0 50.2 32 5.1 50.1 64 5.4 52.8 However I was expecting that with disabled the eager mode the duration of MPI_Send should be longer. Am I wrong? Is there any option for making the MPI_Send to behave like blocking command for all the sizes of the messages? Thanks a lot, Best regards, George Markomanolis
[OMPI users] tool for measuring the ping with accuracy
Dear Eugene, Thanks a lot for the answer you were right for the eager mode. I have one more question. I am looking for an official tool to measure the ping time, just sending a message of 1 byte or more and measure the duration of the MPI_Send command on the rank 0 and the duration of the MPI_Recv on rank 1. I would like to know any formal tool because I am using also SkaMPI and the results really depend on the call of the synchronization before the measurement starts. So for example with synchronizing the processors, sending 1 byte, I have: rank 0, MPI_Send: ~7 ms rank 1, MPI_Recv: ~52 ms where 52 ms is almost the half of the ping-pong and this is ok. Without synchronizing I have: rank 0, MPI_Send: ~7 ms rank 1, MPI_Recv: ~7 ms However I developed a simple application where the rank 0 sends 1000 messages of 1 byte to rank 1 and I have almost the second timings with the 7 ms. If in the same application I add the MPI_Recv and MPI_Send respectively in order to have a ping-pong application then the ping-pong duration is 100ms (like SkaMPI). Can someone explain me why is this happening? The ping-pong takes 100 ms and the ping without synchronization takes 7 ms. Thanks a lot, Best regards, George Markomanolis Message: 1 Date: Thu, 18 Nov 2010 10:31:40 -0800 From: Eugene Loh Subject: Re: [OMPI users] Making MPI_Send to behave as blocking for all the sizes of the messages To: Open MPI Users Message-ID: <4ce5710c.8030...@oracle.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Try lowering the eager threshold more gradually... e.g., 4K, 2K, 1K, 512, etc. -- and watch what happens. I think you will see what you expect, except once you get too small then the value is ignored entirely. So, the setting just won't work at the extreme value (0) you want. Maybe the thing to do is convert your MPI_Send calls to MPI_Ssend calls. Or, compile in a wrapper that intercepts MPI_Send calls and implements them by calling PMPI_Ssend. George Markomanolis wrote: Dear all, I am trying to disable the eager mode in OpenMPI 1.3.3 and I don't see a real difference between the timings. I would like to execute a ping (rank 0 sends a message to rank 1) and to measure the duration of the MPI_Send on rank 0 and the duration of MPI_Recv on rank 1. I have the following results. Without changing the eager mode: bytesMPI_Send (in msec)MPI_Recv (in msec) 15.8 52.2 25.6 51.0 45.4 51.1 85.6 51.6 16 5.5 49.7 32 5.4 52.1 64 5.3 53.3 with disabled the eager mode: ompi_info --param btl tcp | grep eager MCA btl: parameter "btl_tcp_eager_limit" (current value: "0", data source: environment) bytesMPI_Send (in msec)MPI_Recv (in msec) 15.4 52.3 25.4 51.0 45.4 52.1 85.4 50.7 16 5.0 50.2 32 5.1 50.1 64 5.4 52.8 However I was expecting that with disabled the eager mode the duration of MPI_Send should be longer. Am I wrong? Is there any option for making the MPI_Send to behave like blocking command for all the sizes of the messages? Thanks a lot, Best regards, George Markomanolis ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] users Digest, Vol 1750, Issue 1
Message: 2 Date: Tue, 23 Nov 2010 10:27:37 -0800 From: Eugene Loh Subject: Re: [OMPI users] tool for measuring the ping with accuracy To: Open MPI Users Message-ID: <4cec0799.60...@oracle.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed George Markomanolis wrote: Dear Eugene, Thanks a lot for the answer you were right for the eager mode. I have one more question. I am looking for an official tool to measure the ping time, just sending a message of 1 byte or more and measure the duration of the MPI_Send command on the rank 0 and the duration of the MPI_Recv on rank 1. I would like to know any formal tool because I am using also SkaMPI and the results really depend on the call of the synchronization before the measurement starts. So for example with synchronizing the processors, sending 1 byte, I have: rank 0, MPI_Send: ~7 ms rank 1, MPI_Recv: ~52 ms where 52 ms is almost the half of the ping-pong and this is ok. Without synchronizing I have: rank 0, MPI_Send: ~7 ms rank 1, MPI_Recv: ~7 ms However I developed a simple application where the rank 0 sends 1000 messages of 1 byte to rank 1 and I have almost the second timings with the 7 ms. If in the same application I add the MPI_Recv and MPI_Send respectively in order to have a ping-pong application then the ping-pong duration is 100ms (like SkaMPI). Can someone explain me why is this happening? The ping-pong takes 100 ms and the ping without synchronization takes 7 ms. I'm not convinced I'm following you at all. Maybe the following helps, though maybe it's just obvious and misses the point you're trying to make. In a ping-pong test, you have something like this: tsend = MPI_Wtime() MPI_Send tsend = MPI_Wtime() - tsend trecv = MPI_Wtime() MPI_Recv trecv = MPI_Wtime() - trecv The send time times how long it takes to get the message out of the user's send buffer. This time is very short. In contrast, the "receive" time mostly measures how long it takes for the ping message to reach the peer and the pong message to return. The actual time to do the receive processing is very short and accounts for a tiny fraction of trecv. If a sender sends many short messages to a receiver and the two processes don't synchronize much, you can overlap many messages and hide the long transit time. Here's a simple model: sender injects message into interconnect, MPI_Send completes (this time is short) message travels the interconnect to the receiver (this time is long) receiver unpacks the message and MPI_Recv completes (this time is short) A ping-pong test counts the long inter-process transit time. Sending many short messages before synchronizing hides the long transit time. Sorry if this discussion misses the point you're trying to make. Dear Eugene, Thanks a lot, this was what I wanted to know. Now I understood it. Best regards, George Markomanolis
[OMPI users] Understanding the buffering of small messages with tcp network
Dear all, I would like you to ask for a topic that there are already many questions but I am not familiar a lot with it. I want to understand the behaviour of an application where there are many messages less than 64KB (eager mode) and I use TCP network. I am trying to understand in order to simulate this application. For example it can be possible to have one MPI_Send of 1200 bytes after some computation, then two messages of the same size, after computation, etc. However according to the measurements and the profiling the cost of the communication is less than the latency of the network. I can understand that the cost of the MPI_Send is the copy to the buffer however sending the message to the destination it should cost at least the latency. So are the messages buffered in the sender and they are sent as packet to the receiver? My tcp window is 4MB and I use the same value for snd_buff and rcv_buff. If they are buffered in the sender what is the criterion/algorithm? I mean if I have one message, after computation and after again message is it possible these two messages to be buffered from the sender point of view or this happens only on the receiver? If there is any document/paper that I can read about this I would be appreciate to provide me the link. A simple example is that if I have a loop that rank 0 sends two messages to rank 1 then the duration of the first message is bigger than the second's one and if I increase the loop to 10 or 20 messages then all the messages cost a lot less than the first one and also less from what SkaMPI measures. So I am sure that it should be a buffer issue (or something else that I can't think about). Best regards, Georges
[OMPI users] Question about oversubscribing
Dear all, I am trying to execute an experiment by oversubscribing the nodes. So I have available some clusters (I can use up to 8-10 different clusters during one execution) and I have totally around to 1300 cores. I am executing the EP benchmark from the NAS suite which means that there are not a lot of MPI messages, just some collective MPI calls. The number of the MPI processes per node, depends on the available memory of each node. Thus in the machinefile I have declared one node 13 times if I want 13 MPI processes on it. Is that correct? Giving a machinefile of 32768 nodes when I want to execute 32768 processes, does OpenMPI behave like there is no oversubscribing? If yes how can I give a machinefile where there is different number of MPI processes on each node? The maximum number of MPI processes that I have in a node is 388. My problem is that I can execute 16384 processes but not 32768. In the first case I need around to 3 minutes for the execution but in the second case, even after 7 hours the benchmark does not even start. There is no error, I am just cancelling the job by myself but I am assuming that something is wrong because 7 hours it is too much. I have to say that I executed the instance of 16384 processes without any problem. I added some debug info in the benchmark and I can see that the execution is delayed during MPI_Init, it never passes this point. For the instance of 16384 processes I need around to 2 minutes to finish the MPI_Init call. I am checking the memory of all the nodes and there is at least 0.5GB free memory on each node. I know about the parameter mpi_yield_when_idle but I have read that if there are not a lot of MPI messages will not improve the performance. I tried though and nothing changed. I tried also the mpi_preconnect_mpi just in case but again nothing. Could you please indicate a reason why is this happening? Moreover I used just one node with 48GB memory in order to execute 2048 MPI processes without any problem, of course I just had to wait a lot. I am using OpenMPI v1.4.1 and all the clusters are 64 bit. I execute the benchmark with the following command: mpirun --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_exclude ib0,lo,myri0 -machinefile machines -np 32768 ep.D.32768 Best regards, George Markomanolis
Re: [OMPI users] Question about oversubscribing
Dear Ralph, I am copying your email from the web site because I had enabled the option to receive all the emails once per day On 11/04/2012 05:27 PM, George Markomanolis wrote: > Dear all, > > I am trying to execute an experiment by oversubscribing the nodes. So I have available some clusters (I can use up to 8-10 different clusters during one execution) and I have totally around to 1300 cores. I am executing the EP benchmark from the NAS suite which means that there are not a lot of MPI messages, just some collective MPI calls. > > The number of the MPI processes per node, depends on the available memory of each node. Thus in the machinefile I have declared one node 13 times if I want 13 MPI processes on it. Is that correct? You *can* do it that way, or you could just use "slots=13" for that node in the file, and list it only once. OK, but I assume the result is the same, right? > Giving a machinefile of 32768 nodes when I want to execute 32768 processes, does OpenMPI behave like there is no oversubscribing? Yes, it should - I assume you mean "slots" and not "nodes" in the above statement, since you indicate that you listed each node multiple times to set the number of slots on that node. Yes, I mean slots. > If yes how can I give a machinefile where there is different number of MPI processes on each node? The maximum number of MPI processes that I have in a node is 388. Just assign the number of slots on each node to be the number of processes you want on that node OK > > My problem is that I can execute 16384 processes but not 32768. In the first case I need around to 3 minutes for the execution but in the second case, even after 7 hours the benchmark does not even start. There is no error, I am just cancelling the job by myself but I am assuming that something is wrong because 7 hours it is too much. I have to say that I executed the instance of 16384 processes without any problem. I added some debug info in the benchmark and I can see that the execution is delayed during MPI_Init, it never passes this point. For the instance of 16384 processes I need around to 2 minutes to finish the MPI_Init call. I am checking the memory of all the nodes and there is at least 0.5GB free memory on each node. > > I know about the parameter mpi_yield_when_idle but I have read that if there are not a lot of MPI messages will not improve the performance. I tried though and nothing changed. I tried also the mpi_preconnect_mpi just in case but again nothing. Could you please indicate a reason why is this happening? You indicated that these jobs are actually spanning multiple clusters - true? If so, when you cross that 16384 boundary, do you also cross clusters? Is it possible one or more of the additional clusters is blocking communications? I have tried both configurations even using exactly the same nodes with less MPI processes per node in order to check if a site is blocking the rest ones and I have tried the half machinefile for the instance of 16384 in order to see if there is any issue by using so many MPI processes per node. Both were executed fine with the instance of 16384 MPI processes. Also I tried to combine different quarters of the machinefile in order to check if there is any issue with the combination of specific sites and again I didn't have a problem. > > Moreover I used just one node with 48GB memory in order to execute 2048 MPI processes without any problem, of course I just had to wait a lot. > > I am using OpenMPI v1.4.1 and all the clusters are 64 bit. > > I execute the benchmark with the following command: > mpirun --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_exclude ib0,lo,myri0 -machinefile machines -np 32768 ep.D.32768 You could just leave off the "-np N" part of the command line - we'll assign one process to every slot specified in the machinefile. OK, nice Best regards, George Markomanolis > > Best regards, > George Markomanolis > ___ > users mailing list > users_at_[hidden] >http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Maximum number of MPI processes on a node + discovering faulty nodes
Dear all, Initially I would like an advice of how to identify the maximum number of MPI processes that can be executed on a node with oversubscribing. When I try to execute an application with 4096 MPI processes on a 24-cores node with 48GB of memory, I have an error "Unknown error: 1" while the memory is not even at the half. I can execute the same application with 2048 MPI processes in less than one minute. I have checked linux settings about maximum number of processes and it is much bigger than 4096. Another more generic question, is about discovering nodes with faulty memory. Is there any way to identify nodes with faulty memory? I found accidentally that a node with exact the same hardware couldn't execute an MPI application when it was using more than 12GB of ram while the second one could use all of the 48GB of memory. If I have 500+ nodes is difficult to check all of them and I am not familiar with any efficient solution. Initially I thought about memtester but it takes a lot of time. I know that this does not apply exactly on this mailing list but I thought that maybe an OpenMPI user knows something about. Best regards, George Markomanolis
Re: [OMPI users] Maximum number of MPI processes on a node + discovering faulty nodes
Dear Ralph, Thanks for the answer, I am using OMPI v1.4.1. Best regards, George Markomanolis On 11/26/2012 05:07 PM, Ralph Castain wrote: What version of OMPI are you using? On Nov 26, 2012, at 1:02 AM, George Markomanolis wrote: Dear all, Initially I would like an advice of how to identify the maximum number of MPI processes that can be executed on a node with oversubscribing. When I try to execute an application with 4096 MPI processes on a 24-cores node with 48GB of memory, I have an error "Unknown error: 1" while the memory is not even at the half. I can execute the same application with 2048 MPI processes in less than one minute. I have checked linux settings about maximum number of processes and it is much bigger than 4096. Another more generic question, is about discovering nodes with faulty memory. Is there any way to identify nodes with faulty memory? I found accidentally that a node with exact the same hardware couldn't execute an MPI application when it was using more than 12GB of ram while the second one could use all of the 48GB of memory. If I have 500+ nodes is difficult to check all of them and I am not familiar with any efficient solution. Initially I thought about memtester but it takes a lot of time. I know that this does not apply exactly on this mailing list but I thought that maybe an OpenMPI user knows something about. Best regards, George Markomanolis ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Maximum number of MPI processes on a node + discovering faulty nodes
Dear Jeff, Of course I was thinking to execute memtester on each node on the same time and gather the outputs. However executing memtester on a node with 48GB memory it takes a lot of time (more than 1-2 hours, I don't remember exactly, maybe even more because I cancelled its execution) and I have to consume resources just for testing. So I was curious if you know a tool/procedure that works much faster. Of course filling the memory with an application works also but I don't know how right it is. Best regards, George Markomanolis On 11/26/2012 06:09 PM, Jeff Squyres wrote: On Nov 26, 2012, at 4:02 AM, George Markomanolis wrote: Another more generic question, is about discovering nodes with faulty memory. Is there any way to identify nodes with faulty memory? I found accidentally that a node with exact the same hardware couldn't execute an MPI application when it was using more than 12GB of ram while the second one could use all of the 48GB of memory. If I have 500+ nodes is difficult to check all of them and I am not familiar with any efficient solution. Initially I thought about memtester but it takes a lot of time. I know that this does not apply exactly on this mailing list but I thought that maybe an OpenMPI user knows something about. You really do want something like a memory tester. MPI applications *might* beat on your memory to identify errors, but that's really just a side effect of HPC access patterns. You really want a dedicated memory tester. If such a memory tester takes a long time, you might want to use mpirun to launch it on multiple nodes simultaneously to save some time...?
Re: [OMPI users] Maximum number of MPI processes on a node + discovering faulty nodes
Dear Ralph, For the file descriptors the declared limit is over 65536 files but if OMPI needs several of them, then this can be interesting. Is there any source to read about it or I just should do trials? About the child processes again, can I do something? I have root access, so I can change the values. Best regards, George Markomanolis On 11/27/2012 05:58 PM, Ralph Castain wrote: Just glancing at the code, I don't see anything tied to 2**12 that pops out at me. I suspect the issue is that you are hitting a system limit on the number of child processes a process can spawn - this is different from the total number of processes allowed on the node - or the number of file descriptors a process can have open (we need several per process for I/O forwarding). On Nov 27, 2012, at 8:24 AM, George Markomanolis wrote: Dear Ralph, Thanks for the answer, I am using OMPI v1.4.1. Best regards, George Markomanolis On 11/26/2012 05:07 PM, Ralph Castain wrote: What version of OMPI are you using? On Nov 26, 2012, at 1:02 AM, George Markomanolis wrote: Dear all, Initially I would like an advice of how to identify the maximum number of MPI processes that can be executed on a node with oversubscribing. When I try to execute an application with 4096 MPI processes on a 24-cores node with 48GB of memory, I have an error "Unknown error: 1" while the memory is not even at the half. I can execute the same application with 2048 MPI processes in less than one minute. I have checked linux settings about maximum number of processes and it is much bigger than 4096. Another more generic question, is about discovering nodes with faulty memory. Is there any way to identify nodes with faulty memory? I found accidentally that a node with exact the same hardware couldn't execute an MPI application when it was using more than 12GB of ram while the second one could use all of the 48GB of memory. If I have 500+ nodes is difficult to check all of them and I am not familiar with any efficient solution. Initially I thought about memtester but it takes a lot of time. I know that this does not apply exactly on this mailing list but I thought that maybe an OpenMPI user knows something about. Best regards, George Markomanolis ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi+torque: How run job in a subset of the allocation?
Hi, Here is what I do to execute 20 mpirun calls using LSF and one job but it is similar for your case I assume. I use $LSB_HOSTS to extract the hosts from the job. I know how many cores I want per job so I create machine files. For our application, each execution has its own nodes but the last MPI processes are in shared node. For example if I have two mpirun calls I need 40 cores (20 cores each one). I use three nodes (16 cores per node). First mpirun call: first node + 0-3 core on the second node. Second mpirun call: third node + 4-7 core on the second node. I do this in order not to waste resources as I will need to execute ~20 mpirun calls not just two and also the last 4 MPI processes do different task from the first 16 ones. So I create machine files like that: rank 0=s15r1b45 slot=0 rank 1=s15r1b45 slot=1 rank 2=s15r1b45 slot=2 rank 3=s15r1b45 slot=3 Now from the root node execute multiple mpirun calls like: mpirun & and after them use the command wait. So you start many mpirun calls on the background and with the wait you are sure that the job will not be killed before the executions are finished. Just be careful that the machine files do not include common resources (cores in my case). Best regards, George Markomanolis On 11/27/2013 10:02 PM, Ralph Castain wrote: I'm afraid the two solvers would be in the same comm_world if launched that way Sent from my iPhone On Nov 27, 2013, at 11:58 AM, Gus Correa wrote: Hi Ola, Ralph I may be wrong, but I'd guess launching the two solvers in MPMD/MIMD mode would work smoothly with the torque PBS_NODEFILE, in a *single* Torque job. If I understood Ola right, that is what he wants. Say, something like this (for one 32-core node): #PBS -l nodes=1:ppn=32 ... mpiexec -np 8 ./solver1 : -np 24 ./solver2 I am assuming the two executables never talk to each other, right? They solve the same problem with different methods, in a completely independent and "embarrassingly parallel" fashion, and could run concurrently. Is that right? Or did I misunderstand Ola's description, and they work in a staggered sequence to each other? [first s1, then s2, then s1 again, then s2 once more...] I am a bit confused by Ola's use of the words "loosely coupled" in his description, which might indicate cooperation to solve the same problem, rather than independent work on two instances of the same problem. Ralph: Does the MPI model assume that MPMD/MIMD executables have to necessarily communicate with each other, or perhaps share a common MPI_COMM_WORLD? [I guess not.] Anyway, just a guess, Gus Correa On 11/27/2013 10:23 AM, Ralph Castain wrote: Are you wanting to run the solvers on different nodes within the allocation? Or on different cores across all nodes? For different nodes, you can just use -host to specify which nodes you want that specific mpirun to use, or a hostfile should also be fine. The FAQ's comment was aimed at people who were giving us the PBS_NODEFILE as the hostfile - which could confuse older versions of OMPI into using the rsh launcher instead of Torque. Remember that we have the relative node syntax so you don't actually have to name the nodes - helps if you want to execute batch scripts and won't know the node names in advance. For different cores across all nodes, you would need to use some binding trickery that may not be in the 1.4 series, so you might need to update to the 1.6 series. You have two options: (a) have Torque bind your mpirun to specific cores (I believe it can do that), or (b) use --slot-list to specify which cores that particular mpirun is to use. You can then separate the two solvers but still run on all the nodes, if that is of concern. HTH Ralph On Wed, Nov 27, 2013 at 6:10 AM, mailto:ola.widl...@se.abb.com>> wrote: Hi, We have an in-house application where we run two solvers in a loosely coupled manner: The first solver runs a timestep, then the second solver does work on the same timestep, etc. As the two solvers never execute at the same time, we would like to run the two solvers in the same allocation (launching mpirun once for each of them). RAM is not an issue, so there should not be any risk of excessive swapping degrading performance. We use openmpi-1.4.5 compiled with torque integration. The torque integration means we do not give a hostfile to mpirun, it will itself query torque for the allocation info. Question: Can we force one of the solvers to run in a *subset* of the full allocation? How do we do that? I read in the FAQ that providing a hostfile to mpirun in this case (when it's not needed due to torque integration) would cause a lot of problems... Thanks in advance, Ola ___ users mailing list us...@open-mpi.org <mailto:u
[OMPI users] question about algorithms for collective communication
Dear all, I am trying to figure out the algorithms that are used for some collective communications (allreduce, bcast, alltoall). Is there any document to explain which algorithms are used? For example I would like to know exactly how the command allreduce is analyzed to send and receive. Thanks a lot, Best regards, George Markomanolis __ Information from ESET Smart Security, version of virus signature database 4360 (20090823) __ The message was checked by ESET Smart Security. http://www.eset.com
[OMPI users] using specific algorithm for collective communication, and knowing the root cpu?
Dear all, I would like to ask about collective communication. With debug mode enabled, I can see many info during the execution which algorithm is used etc. But my question is that I would like to use a specific algorithm (the simplest I suppose). I am profiling some applications and I want to simulate them with another program so I must be able to know for example what the mpi_allreduce is doing. I saw many algorithms that depend on the message size and the number of processors, so I would like to ask: 1) what is the way to say at open mpi to use a simple algorithm for allreduce (is there any way to say to use the simplest algorithm for all the collective communication?). Basically I would like to know the root cpu for every collective communication. What are the disadvantages for demanding the simplest algorithm? 2) Is there any overhead because I installed open mpi with debug mode even if I just run a program without any flag with --mca? 3) How you could describe allreduce by words? Can we say that the root cpu does reduce and then broadcast? I mean is that right for your implementation? I saw that it depends on the algorithm which cpu is the root, so is it possible to use an algorithm that I will know every time that cpu with rank 0 is the root? Thanks a lot, George
Re: [OMPI users] using specific algorithm for collective communication and knowing the root cpu?
Dear George, Thanks for the answer, I have some questions, because I am using some programs for profiling, when you say that the cost of allreduce raise you mean about the time only or also and the flops of this command? Is there some additional work added at the allreduce or it's only about time? During profiling I want to count the flops so if there is a small difference on timing because of debug mode and declaration of the allreduce algorithm is not so big deal, but if it changes also the flops then it is bad for me. When I executed a program with debug mode I saw that openmpi uses some algorithms and I looked at your code and I saw that rank 0 is not always the root cpu (if I understood right). Finally do you have any opinion about which is the best way to know the algorithm is used in collective communication and the root cpu of the communication? Best regards, George Today's Topics: 1. Re: using specific algorithm for collective communication, and knowing the root cpu? (George Bosilca) -- Message: 1 Date: Tue, 3 Nov 2009 12:09:18 -0500 From: George Bosilca Subject: Re: [OMPI users] using specific algorithm for collective communication, and knowing the root cpu? To: Open MPI Users Message-ID: Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes You can add the following MCA parameters either on the command line or in the $(HOME)/.openmpi/mca-params.conf file. On Nov 2, 2009, at 08:52 , George Markomanolis wrote: Dear all, I would like to ask about collective communication. With debug mode enabled, I can see many info during the execution which algorithm is used etc. But my question is that I would like to use a specific algorithm (the simplest I suppose). I am profiling some applications and I want to simulate them with another program so I must be able to know for example what the mpi_allreduce is doing. I saw many algorithms that depend on the message size and the number of processors, so I would like to ask: 1) what is the way to say at open mpi to use a simple algorithm for allreduce (is there any way to say to use the simplest algorithm for all the collective communication?). Basically I would like to know the root cpu for every collective communication. What are the disadvantages for demanding the simplest algorithm? coll_tuned_use_dynamic_rules=1 to allow you to manually set the algorithms to be used. coll_tuned_allreduce_algorithm=*something between 0 and 5* to describe the algorithm to be user. For the simplest algorithm I guess you will want to use 1 (star based fan-in fan-out). The main disadvantage is that the cost of the allreduce will raise which will negatively impact the overall performance of the application. 2) Is there any overhead because I installed open mpi with debug mode even if I just run a program without any flag with --mca? There are many overhead because you compile in debug mode. We do a lot of extra tracking of internally allocate memory, checks on most/all internal objects and so on. Based on previous results I would say your latency increase by about 2-3 micro-secs, but the impact on the bandwidth is minimal. 3) How you could describe allreduce by words? Can we say that the root cpu does reduce and then broadcast? I mean is that right for your implementation? I saw that it depends on the algorithm which cpu is the root, so is it possible to use an algorithm that I will know every time that cpu with rank 0 is the root? Exactly, allreduce = reduce + bcast (and btw this is what the algorithm basic will do). However, there is no root in an allreduce as all processors execute symmetric work. Of course if one see the allreduce as a reduce followed by a broadcast then one has to select a root (I guess we pick the rank 0 in our implementation). george. Thanks a lot, George ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users