Re: [OMPI users] (no subject)

Jeff Squyres Wed, 23 Apr 2008 17:51:12 -0400

Contributions are always welcome.  :-)

    http://www.open-mpi.org/community/contribute/

To be less glib: Open MPI represents the union of the interests of itsmembers. So far, we've *talked* internally about adding a spin-then-block mechanism, but there's a non-trivial amount of work to make thathappen. Shared memory is the sticking point -- we have some goodideas how to make it work, but no one's had the time / resources to doit. To be absolutely clear: no one's debating the value of addingblocking progress (as long as it's implemented in a way thatabsolutely does not affect the performance critical code path). It'sjust that so far, it has not been important to any current member toadd it (weighed against all the other features that we're working on).

If you (or anyone) find blocking progress important, we'd love for youto join Open MPI and contribute the the work necessary to make ithappen.




On Apr 23, 2008, at 5:38 PM, Ingo Josopait wrote:

I can think of several advantages that using blocking or signals to
reduce the cpu load would have:

- Reduced energy consumption

- Running additional background programs could be done far moreefficiently

- It would be much simpler to examine the load balance.

It may depend on the type of program and the computationalenvironment,

but there are certainly many cases in which putting the system in idle
mode would be advantageous. This is especially true for programs with
low network traffic and/or high load imbalances.

The "spin for a while and then block" method that you mentionedearlier

seems to be a good compromise. Just do polling for some time that is

long compared to the corresponding system call, and then go to sleepif

nothing happens. In this way, the latency would be only marginally
increased, while less cpu time is wasted in the polling loops, and I
would be much happier.





Jeff Squyres schrieb:

On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote:

Do you really mean that Open-MPI uses busy loop in order to handle
incomming calls? It seems to be incorrect since
spinning is a very bad and inefficient technique for this purpose.


It depends on what you're optimizing for.  :-)  We're optimizing for
minimum message passing latency on hosts that are not oversubscribed;
polling is very good at that.  Polling is much better than blocking,
particularly if the blocking involves a system call (which will be
"slow").  Note that in a compute-heavy environment, they nodes are
going to be running at 100% CPU anyway.

Also keep in mind that you're only going to have "waste" spinning in
MPI if you have a loosely/poorly synchronized application.  Granted,
some applications are this way by nature, but we have not chosen to

optimize spare CPU cycles for them. As I said in a prior mail,adding

a blocking strategy is on the to-do list, but it's fairly low in
priority right now.  Someone may care / improve the message passing
engine to include blocking, but it hasn't happened yet.  Want to work
on it?  :-)

And for reference: almost all MPI's do busy polling to minimize

latency. Some of them will shift to blocking if nothing happensfor a

"long" time.  This second piece is what OMPI is lacking.

Why
don't you use blocking and/or signals instead of
that?

FWIW: I mentioned this in my other mail -- latency is quitedefinitely

negatively impacted when you use such mechanisms.  Blocking and
signals are "slow" (in comparison to polling).

I think the priority of this task is very high because polling
just wastes resources of the system.


In production HPC environments, the entire resource is dedicated to
the MPI app anyway, so there's nothing else that really needs it.  So
we allow them to busy-spin.

There is a mode to call yield() in the middle of every OMPI progress
loop, but it's only helpful for loosely/poorly synchronized MPI apps
and ones that use TCP or shared memory.  Low latency networks such as
IB or Myrinet won't be as friendly to this setting because they're

busy polling (i.e., they call yield() much less frequently, if atall).

On the other hand,
what Alberto claims is not reasonable to me.

Alberto,
- Are you oversubscribing one node which means that you are running
your
code on a single processor machine, pretending
to have four CPUs?

- Did you compile Open-MPI or installed from RPM?

Receiving process shouldn't be that expensive.

Regards,

Danesh



Jeff Squyres skrev:

Because on-node communication typically uses shared memory, so we
currently have to poll.  Additionally, when using mixed on/off-node

communication, we have to alternate between polling shared memoryand

polling the network.

Additionally, we actively poll because it's the best way to lower

latency. MPI implementations are almost always first judged ontheir

latency, not [usually] their CPU utilization.  Going to sleep in a
blocking system call will definitely negatively impact latency.

We have plans for implementing the "spin for a while and thenblock"

technique (as has been used in other MPI's and middleware layers),
but
it hasn't been a high priority.


On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote:

Thanks Torje. I wonder what is the benefit of looping on the
incoming

message-queue socket rather than using system I/O signals, likeread

() or select().

On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote:

Hi Alberto,

The blocked processes are in fact spin-waiting. While they don't
have
anything better to do (waiting for that message), they will check
their incoming message-queues in a loop.

So the MPI_Recv()-operation is blocking, but it doesn't mean that
the
processes are blocked by the OS scheduler.


I hope that made some sense :)


Best regards,

Torje


On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote:

I have simple MPI program that sends data to processor rank 0.Thecommunication works well but when I run the program on morethan 2

processors (-np 4) the extra receivers waiting for data run on >
90%

CPU load. I understand MPI_Recv() is a blocking operation, butwhy

does it consume so much CPU compared to a regular system read()?



#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

void process_sender(int);
void process_receiver(int);


int main(int argc, char* argv[])
{
int rank;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

printf("Processor %d (%d) initialized\n", rank, getpid());

if( rank == 1 )
 process_sender(rank);
else
 process_receiver(rank);

MPI_Finalize();
}


void process_sender(int rank)
{
int i, j, size;
float data[100];
MPI_Status status;

printf("Processor %d initializing data...\n", rank);
for( i = 0; i < 100; ++i )
 data[i] = i;

MPI_Comm_size(MPI_COMM_WORLD, &size);

printf("Processor %d sending data...\n", rank);
MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
printf("Processor %d sent data\n", rank);
}


void process_receiver(int rank)
{
int count;
float value[200];
MPI_Status status;

printf("Processor %d waiting for data...\n", rank);
MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55,
MPI_COMM_WORLD, &status);
printf("Processor %d Got data from processor %d\n", rank,
status.MPI_SOURCE);
MPI_Get_count(&status, MPI_FLOAT, &count);
printf("Processor %d, Got %d elements\n", rank, count);
}

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] (no subject)

Reply via email to