[OMPI users] (no subject)

2008-04-23 Thread Alberto Giannetti
I have simple MPI program that sends data to processor rank 0. The  
communication works well but when I run the program on more than 2  
processors (-np 4) the extra receivers waiting for data run on > 90%  
CPU load. I understand MPI_Recv() is a blocking operation, but why  
does it consume so much CPU compared to a regular system read()?




#include 
#include 
#include 
#include 
#include 

void process_sender(int);
void process_receiver(int);


int main(int argc, char* argv[])
{
  int rank;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  printf("Processor %d (%d) initialized\n", rank, getpid());

  if( rank == 1 )
process_sender(rank);
  else
process_receiver(rank);

  MPI_Finalize();
}


void process_sender(int rank)
{
  int i, j, size;
  float data[100];
  MPI_Status status;

  printf("Processor %d initializing data...\n", rank);
  for( i = 0; i < 100; ++i )
data[i] = i;

  MPI_Comm_size(MPI_COMM_WORLD, &size);

  printf("Processor %d sending data...\n", rank);
  MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
  printf("Processor %d sent data\n", rank);
}


void process_receiver(int rank)
{
  int count;
  float value[200];
  MPI_Status status;

  printf("Processor %d waiting for data...\n", rank);
  MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55,  
MPI_COMM_WORLD, &status);
  printf("Processor %d Got data from processor %d\n", rank,  
status.MPI_SOURCE);

  MPI_Get_count(&status, MPI_FLOAT, &count);
  printf("Processor %d, Got %d elements\n", rank, count);
}



Re: [OMPI users] (no subject)

2008-04-23 Thread Torje Henriksen

Hi Alberto,

The blocked processes are in fact spin-waiting. While they don't have  
anything better to do (waiting for that message), they will check  
their incoming message-queues in a loop.


So the MPI_Recv()-operation is blocking, but it doesn't mean that the  
processes are blocked by the OS scheduler.



I hope that made some sense :)


Best regards,

Torje


On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote:


I have simple MPI program that sends data to processor rank 0. The
communication works well but when I run the program on more than 2
processors (-np 4) the extra receivers waiting for data run on > 90%
CPU load. I understand MPI_Recv() is a blocking operation, but why
does it consume so much CPU compared to a regular system read()?



#include 
#include 
#include 
#include 
#include 

void process_sender(int);
void process_receiver(int);


int main(int argc, char* argv[])
{
  int rank;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  printf("Processor %d (%d) initialized\n", rank, getpid());

  if( rank == 1 )
process_sender(rank);
  else
process_receiver(rank);

  MPI_Finalize();
}


void process_sender(int rank)
{
  int i, j, size;
  float data[100];
  MPI_Status status;

  printf("Processor %d initializing data...\n", rank);
  for( i = 0; i < 100; ++i )
data[i] = i;

  MPI_Comm_size(MPI_COMM_WORLD, &size);

  printf("Processor %d sending data...\n", rank);
  MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
  printf("Processor %d sent data\n", rank);
}


void process_receiver(int rank)
{
  int count;
  float value[200];
  MPI_Status status;

  printf("Processor %d waiting for data...\n", rank);
  MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55,
MPI_COMM_WORLD, &status);
  printf("Processor %d Got data from processor %d\n", rank,
status.MPI_SOURCE);
  MPI_Get_count(&status, MPI_FLOAT, &count);
  printf("Processor %d, Got %d elements\n", rank, count);
}

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] idle calls?

2008-04-23 Thread Ingo Josopait
I noticed that the cpu usage of an mpi program is always at 100 percent,
even if the tasks are doing nothing but wait for new data to arrive. Is
there an option to change this behavior, so that the tasks sleep until
new data arrive?

Why is this the default behavior, anyway? Is it really so costly to set
the task asleep when it is idle?

<>

Re: [OMPI users] (no subject)

2008-04-23 Thread Alberto Giannetti
Thanks Torje. I wonder what is the benefit of looping on the incoming  
message-queue socket rather than using system I/O signals, like read 
() or select().


On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote:

Hi Alberto,

The blocked processes are in fact spin-waiting. While they don't have
anything better to do (waiting for that message), they will check
their incoming message-queues in a loop.

So the MPI_Recv()-operation is blocking, but it doesn't mean that the
processes are blocked by the OS scheduler.


I hope that made some sense :)


Best regards,

Torje


On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote:


I have simple MPI program that sends data to processor rank 0. The
communication works well but when I run the program on more than 2
processors (-np 4) the extra receivers waiting for data run on > 90%
CPU load. I understand MPI_Recv() is a blocking operation, but why
does it consume so much CPU compared to a regular system read()?



#include 
#include 
#include 
#include 
#include 

void process_sender(int);
void process_receiver(int);


int main(int argc, char* argv[])
{
  int rank;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  printf("Processor %d (%d) initialized\n", rank, getpid());

  if( rank == 1 )
process_sender(rank);
  else
process_receiver(rank);

  MPI_Finalize();
}


void process_sender(int rank)
{
  int i, j, size;
  float data[100];
  MPI_Status status;

  printf("Processor %d initializing data...\n", rank);
  for( i = 0; i < 100; ++i )
data[i] = i;

  MPI_Comm_size(MPI_COMM_WORLD, &size);

  printf("Processor %d sending data...\n", rank);
  MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
  printf("Processor %d sent data\n", rank);
}


void process_receiver(int rank)
{
  int count;
  float value[200];
  MPI_Status status;

  printf("Processor %d waiting for data...\n", rank);
  MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55,
MPI_COMM_WORLD, &status);
  printf("Processor %d Got data from processor %d\n", rank,
status.MPI_SOURCE);
  MPI_Get_count(&status, MPI_FLOAT, &count);
  printf("Processor %d, Got %d elements\n", rank, count);
}

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] (no subject)

2008-04-23 Thread Jeff Squyres
Because on-node communication typically uses shared memory, so we  
currently have to poll.  Additionally, when using mixed on/off-node  
communication, we have to alternate between polling shared memory and  
polling the network.


Additionally, we actively poll because it's the best way to lower  
latency.  MPI implementations are almost always first judged on their  
latency, not [usually] their CPU utilization.  Going to sleep in a  
blocking system call will definitely negatively impact latency.


We have plans for implementing the "spin for a while and then block"  
technique (as has been used in other MPI's and middleware layers), but  
it hasn't been a high priority.



On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote:


Thanks Torje. I wonder what is the benefit of looping on the incoming
message-queue socket rather than using system I/O signals, like read
() or select().

On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote:

Hi Alberto,

The blocked processes are in fact spin-waiting. While they don't have
anything better to do (waiting for that message), they will check
their incoming message-queues in a loop.

So the MPI_Recv()-operation is blocking, but it doesn't mean that the
processes are blocked by the OS scheduler.


I hope that made some sense :)


Best regards,

Torje


On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote:


I have simple MPI program that sends data to processor rank 0. The
communication works well but when I run the program on more than 2
processors (-np 4) the extra receivers waiting for data run on > 90%
CPU load. I understand MPI_Recv() is a blocking operation, but why
does it consume so much CPU compared to a regular system read()?



#include 
#include 
#include 
#include 
#include 

void process_sender(int);
void process_receiver(int);


int main(int argc, char* argv[])
{
 int rank;

 MPI_Init(&argc, &argv);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);

 printf("Processor %d (%d) initialized\n", rank, getpid());

 if( rank == 1 )
   process_sender(rank);
 else
   process_receiver(rank);

 MPI_Finalize();
}


void process_sender(int rank)
{
 int i, j, size;
 float data[100];
 MPI_Status status;

 printf("Processor %d initializing data...\n", rank);
 for( i = 0; i < 100; ++i )
   data[i] = i;

 MPI_Comm_size(MPI_COMM_WORLD, &size);

 printf("Processor %d sending data...\n", rank);
 MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
 printf("Processor %d sent data\n", rank);
}


void process_receiver(int rank)
{
 int count;
 float value[200];
 MPI_Status status;

 printf("Processor %d waiting for data...\n", rank);
 MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55,
MPI_COMM_WORLD, &status);
 printf("Processor %d Got data from processor %d\n", rank,
status.MPI_SOURCE);
 MPI_Get_count(&status, MPI_FLOAT, &count);
 printf("Processor %d, Got %d elements\n", rank, count);
}

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] idle calls?

2008-04-23 Thread Jeff Squyres

Please see another ongoing thread on this list about this exact topic:

http://www.open-mpi.org/community/lists/users/2008/04/5457.php

It unfortunately has a subject of "(no subject)", so it's not obvious  
that this is what the thread is about.



On Apr 23, 2008, at 12:14 PM, Ingo Josopait wrote:

I noticed that the cpu usage of an mpi program is always at 100  
percent,
even if the tasks are doing nothing but wait for new data to arrive.  
Is

there an option to change this behavior, so that the tasks sleep until
new data arrive?

Why is this the default behavior, anyway? Is it really so costly to  
set

the task asleep when it is idle?

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] (no subject)

2008-04-23 Thread Alberto Giannetti
I am running the test program on Darwin 8.11.1, 1.83 Ghz Intel dual  
core. My Open MPI install is 1.2.4.
I can't see any allocated shared memory segment on my system (ipcs - 
m), although the receiver opens a couple of TCP sockets in listening  
mode. It looks like my implementation does not use shared memory. Is  
this a configuration issue?


a.out   5628 albertogiannetti3u  unixR,W,NB  
0x380b198   0t0  ->0x41ced48
a.out   5628 albertogiannetti4u  unix   R,W  
0x41ced48   0t0  ->0x380b198
a.out   5628 albertogiannetti5u  IPv4R,W,NB  
0x3d4d920   0t0  TCP *:50969 (LISTEN)
a.out   5628 albertogiannetti6u  IPv4R,W,NB  
0x3e62394   0t0  TCP 192.168.0.10:50970->192.168.0.10:50962  
(ESTABLISHED)
a.out   5628 albertogiannetti7u  IPv4R,W,NB  
0x422d228   0t0  TCP *:50973 (LISTEN)
a.out   5628 albertogiannetti8u  IPv4R,W,NB  
0x2dfd394   0t0  TCP 192.168.0.10:50969->192.168.0.10:50975  
(ESTABLISHED)



On Apr 23, 2008, at 12:34 PM, Jeff Squyres wrote:

Because on-node communication typically uses shared memory, so we
currently have to poll.  Additionally, when using mixed on/off-node
communication, we have to alternate between polling shared memory and
polling the network.

Additionally, we actively poll because it's the best way to lower
latency.  MPI implementations are almost always first judged on their
latency, not [usually] their CPU utilization.  Going to sleep in a
blocking system call will definitely negatively impact latency.

We have plans for implementing the "spin for a while and then block"
technique (as has been used in other MPI's and middleware layers), but
it hasn't been a high priority.


On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote:


Thanks Torje. I wonder what is the benefit of looping on the incoming
message-queue socket rather than using system I/O signals, like read
() or select().

On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote:

Hi Alberto,

The blocked processes are in fact spin-waiting. While they don't  
have

anything better to do (waiting for that message), they will check
their incoming message-queues in a loop.

So the MPI_Recv()-operation is blocking, but it doesn't mean that  
the

processes are blocked by the OS scheduler.


I hope that made some sense :)


Best regards,

Torje


On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote:


I have simple MPI program that sends data to processor rank 0. The
communication works well but when I run the program on more than 2
processors (-np 4) the extra receivers waiting for data run on >  
90%

CPU load. I understand MPI_Recv() is a blocking operation, but why
does it consume so much CPU compared to a regular system read()?



#include 
#include 
#include 
#include 
#include 

void process_sender(int);
void process_receiver(int);


int main(int argc, char* argv[])
{
 int rank;

 MPI_Init(&argc, &argv);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);

 printf("Processor %d (%d) initialized\n", rank, getpid());

 if( rank == 1 )
   process_sender(rank);
 else
   process_receiver(rank);

 MPI_Finalize();
}


void process_sender(int rank)
{
 int i, j, size;
 float data[100];
 MPI_Status status;

 printf("Processor %d initializing data...\n", rank);
 for( i = 0; i < 100; ++i )
   data[i] = i;

 MPI_Comm_size(MPI_COMM_WORLD, &size);

 printf("Processor %d sending data...\n", rank);
 MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
 printf("Processor %d sent data\n", rank);
}


void process_receiver(int rank)
{
 int count;
 float value[200];
 MPI_Status status;

 printf("Processor %d waiting for data...\n", rank);
 MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55,
MPI_COMM_WORLD, &status);
 printf("Processor %d Got data from processor %d\n", rank,
status.MPI_SOURCE);
 MPI_Get_count(&status, MPI_FLOAT, &count);
 printf("Processor %d, Got %d elements\n", rank, count);
}

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] (no subject)

2008-04-23 Thread Jeff Squyres

OMPI doesn't use SYSV shared memory; it uses mmaped files.

ompi_info will tell you all about the components installed.  If you  
see a BTL component named "sm", then shared memory support is  
installed.  I do not believe that we conditionally install sm on Linux  
or OS X systems -- it should always be installed.


ompi_info | grep btl



On Apr 23, 2008, at 2:55 PM, Alberto Giannetti wrote:


I am running the test program on Darwin 8.11.1, 1.83 Ghz Intel dual
core. My Open MPI install is 1.2.4.
I can't see any allocated shared memory segment on my system (ipcs -
m), although the receiver opens a couple of TCP sockets in listening
mode. It looks like my implementation does not use shared memory. Is
this a configuration issue?


a.out   5628 albertogiannetti3u  unixR,W,NB
0x380b198   0t0  ->0x41ced48
a.out   5628 albertogiannetti4u  unix   R,W
0x41ced48   0t0  ->0x380b198
a.out   5628 albertogiannetti5u  IPv4R,W,NB
0x3d4d920   0t0  TCP *:50969 (LISTEN)
a.out   5628 albertogiannetti6u  IPv4R,W,NB
0x3e62394   0t0  TCP 192.168.0.10:50970->192.168.0.10:50962
(ESTABLISHED)
a.out   5628 albertogiannetti7u  IPv4R,W,NB
0x422d228   0t0  TCP *:50973 (LISTEN)
a.out   5628 albertogiannetti8u  IPv4R,W,NB
0x2dfd394   0t0  TCP 192.168.0.10:50969->192.168.0.10:50975
(ESTABLISHED)



On Apr 23, 2008, at 12:34 PM, Jeff Squyres wrote:

Because on-node communication typically uses shared memory, so we
currently have to poll.  Additionally, when using mixed on/off-node
communication, we have to alternate between polling shared memory and
polling the network.

Additionally, we actively poll because it's the best way to lower
latency.  MPI implementations are almost always first judged on their
latency, not [usually] their CPU utilization.  Going to sleep in a
blocking system call will definitely negatively impact latency.

We have plans for implementing the "spin for a while and then block"
technique (as has been used in other MPI's and middleware layers),  
but

it hasn't been a high priority.


On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote:

Thanks Torje. I wonder what is the benefit of looping on the  
incoming

message-queue socket rather than using system I/O signals, like read
() or select().

On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote:

Hi Alberto,

The blocked processes are in fact spin-waiting. While they don't
have
anything better to do (waiting for that message), they will check
their incoming message-queues in a loop.

So the MPI_Recv()-operation is blocking, but it doesn't mean that
the
processes are blocked by the OS scheduler.


I hope that made some sense :)


Best regards,

Torje


On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote:


I have simple MPI program that sends data to processor rank 0. The
communication works well but when I run the program on more than 2
processors (-np 4) the extra receivers waiting for data run on >
90%
CPU load. I understand MPI_Recv() is a blocking operation, but why
does it consume so much CPU compared to a regular system read()?



#include 
#include 
#include 
#include 
#include 

void process_sender(int);
void process_receiver(int);


int main(int argc, char* argv[])
{
int rank;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

printf("Processor %d (%d) initialized\n", rank, getpid());

if( rank == 1 )
  process_sender(rank);
else
  process_receiver(rank);

MPI_Finalize();
}


void process_sender(int rank)
{
int i, j, size;
float data[100];
MPI_Status status;

printf("Processor %d initializing data...\n", rank);
for( i = 0; i < 100; ++i )
  data[i] = i;

MPI_Comm_size(MPI_COMM_WORLD, &size);

printf("Processor %d sending data...\n", rank);
MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
printf("Processor %d sent data\n", rank);
}


void process_receiver(int rank)
{
int count;
float value[200];
MPI_Status status;

printf("Processor %d waiting for data...\n", rank);
MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55,
MPI_COMM_WORLD, &status);
printf("Processor %d Got data from processor %d\n", rank,
status.MPI_SOURCE);
MPI_Get_count(&status, MPI_FLOAT, &count);
printf("Processor %d, Got %d elements\n", rank, count);
}

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailma

[OMPI users] Processor affinitiy

2008-04-23 Thread Alberto Giannetti
I would like to run one of my MPI processors to a single core on my  
iMac Intel Core Duo system. I'm using release 1.2.4 on Darwin 8.11.1.  
It looks like processor affinity is not supported for this kind of  
configuration:



$ ompi_info|grep affinity
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component  
v1.2.4)



Open MPI: 1.2.4
   Open MPI SVN revision: r16187
Open RTE: 1.2.4
   Open RTE SVN revision: r16187
OPAL: 1.2.4
   OPAL SVN revision: r16187
  Prefix: /usr/local/
 Configured architecture: i386-apple-darwin8.10.0
   Configured by: brbarret
   Configured on: Tue Oct 16 12:03:41 EDT 2007
  Configure host: ford.osl.iu.edu
Built by: brbarret
Built on: Tue Oct 16 12:15:57 EDT 2007
  Built host: ford.osl.iu.edu


Can I get processor affinity recompiling the distribution, maybe with  
different option?


Re: [OMPI users] (no subject)

2008-04-23 Thread Danesh Daroui
Do you really mean that Open-MPI uses busy loop in order to handle 
incomming calls? It seems to be incorrect since
spinning is a very bad and inefficient technique for this purpose. Why 
don't you use blocking and/or signals instead of
that? I think the priority of this task is very high because polling 
just wastes resources of the system. On the other hand,

what Alberto claims is not reasonable to me.

Alberto,
- Are you oversubscribing one node which means that you are running your 
code on a single processor machine, pretending

to have four CPUs?

- Did you compile Open-MPI or installed from RPM?

Receiving process shouldn't be that expensive.

Regards,

Danesh



Jeff Squyres skrev:
Because on-node communication typically uses shared memory, so we  
currently have to poll.  Additionally, when using mixed on/off-node  
communication, we have to alternate between polling shared memory and  
polling the network.


Additionally, we actively poll because it's the best way to lower  
latency.  MPI implementations are almost always first judged on their  
latency, not [usually] their CPU utilization.  Going to sleep in a  
blocking system call will definitely negatively impact latency.


We have plans for implementing the "spin for a while and then block"  
technique (as has been used in other MPI's and middleware layers), but  
it hasn't been a high priority.



On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote:

  

Thanks Torje. I wonder what is the benefit of looping on the incoming
message-queue socket rather than using system I/O signals, like read
() or select().

On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote:


Hi Alberto,

The blocked processes are in fact spin-waiting. While they don't have
anything better to do (waiting for that message), they will check
their incoming message-queues in a loop.

So the MPI_Recv()-operation is blocking, but it doesn't mean that the
processes are blocked by the OS scheduler.


I hope that made some sense :)


Best regards,

Torje


On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote:

  

I have simple MPI program that sends data to processor rank 0. The
communication works well but when I run the program on more than 2
processors (-np 4) the extra receivers waiting for data run on > 90%
CPU load. I understand MPI_Recv() is a blocking operation, but why
does it consume so much CPU compared to a regular system read()?



#include 
#include 
#include 
#include 
#include 

void process_sender(int);
void process_receiver(int);


int main(int argc, char* argv[])
{
 int rank;

 MPI_Init(&argc, &argv);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);

 printf("Processor %d (%d) initialized\n", rank, getpid());

 if( rank == 1 )
   process_sender(rank);
 else
   process_receiver(rank);

 MPI_Finalize();
}


void process_sender(int rank)
{
 int i, j, size;
 float data[100];
 MPI_Status status;

 printf("Processor %d initializing data...\n", rank);
 for( i = 0; i < 100; ++i )
   data[i] = i;

 MPI_Comm_size(MPI_COMM_WORLD, &size);

 printf("Processor %d sending data...\n", rank);
 MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
 printf("Processor %d sent data\n", rank);
}


void process_receiver(int rank)
{
 int count;
 float value[200];
 MPI_Status status;

 printf("Processor %d waiting for data...\n", rank);
 MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55,
MPI_COMM_WORLD, &status);
 printf("Processor %d Got data from processor %d\n", rank,
status.MPI_SOURCE);
 MPI_Get_count(&status, MPI_FLOAT, &count);
 printf("Processor %d, Got %d elements\n", rank, count);
}

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  


[OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-23 Thread Sharon Brunett

Hello,
I'm using openmpi-1.3a1r18241 on a 2 node configuration and having troubles 
with the ompi-restart.  I can successfully ompi-checkpoint and ompi-restart a 1 
way mpi code.
When I try a 2 way job running across 2 nodes, I get

bash-2.05b$ ompi-restart -verbose ompi_global_snapshot_926.ckpt
[shc005:01159] Checking for the existence of 
(/home/sharon/ompi_global_snapshot_926.ckpt)
[shc005:01159] Restarting from file (ompi_global_snapshot_926.ckpt)
[shc005:01159]   Exec in self
Restart failed: Permission denied
Restart failed: Permission denied 




If I try running as root, using the same snapshot file, the code restarts ok, 
but both tasks and up on the same node, rather than one per node (like the 
original mpirun).

I'm using BLCR version 0.6.5.
I generate checkpoints via 'ompi-checkpoint pid'
where pid is the pid of the mpirun task below

mpirun -np 2 -am ft-enable-cr ./xhpl


Thanks very much for any hints you can give on how to resolve either of these 
problems.


Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-23 Thread Josh Hursey


On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote:


Hello,
I'm using openmpi-1.3a1r18241 on a 2 node configuration and having  
troubles with the ompi-restart.  I can successfully ompi-checkpoint  
and ompi-restart a 1 way mpi code.

When I try a 2 way job running across 2 nodes, I get

bash-2.05b$ ompi-restart -verbose ompi_global_snapshot_926.ckpt
[shc005:01159] Checking for the existence of (/home/sharon/ 
ompi_global_snapshot_926.ckpt)

[shc005:01159] Restarting from file (ompi_global_snapshot_926.ckpt)
[shc005:01159]   Exec in self
Restart failed: Permission denied
Restart failed: Permission denied



This error is coming from BLCR. A few things to check.

First take a look at /var/log/messages on the machine(s) you are  
trying to restart on. Per:

 http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#eperm

Next check to make sure prelinking is turned off on the two machines  
you are using. Per:

 http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink

Those will rule out some common BLCR problems. (more below)




If I try running as root, using the same snapshot file, the code  
restarts ok, but both tasks and up on the same node, rather than one  
per node (like the original mpirun).


You should never have to run as root to restart a process (or to run  
Open MPI in any form). So I'm wondering if your user has permissions  
to access the checkpoint files that BLCR is generating. You can look  
at the permissions for the individual checkpoint files by looking into  
the checkpoint handler directory. They are a bit hidden, so something  
like the following should expose them:

---
shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/ 
opal_snapshot_0.ckpt/

total 1756
drwx--  2 sharon users4096 Apr 23 16:29 .
drwx--  4 sharon users4096 Apr 23 16:29 ..
-rw---  1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31849
-rw-r--r--  1 sharon users  35 Apr 23 16:29 snapshot_meta.data
shell$
shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/ 
opal_snapshot_1.ckpt/

total 1756
drwx--  2 sharon users4096 Apr 23 16:29 .
drwx--  4 sharon users4096 Apr 23 16:29 ..
-rw---  1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31850
-rw-r--r--  1 sharon users  35 Apr 23 16:29 snapshot_meta.data
---

The BLCR generated context files are "ompi_blcr_context.PID", and you  
need to check to make sure that you have sufficient permissions to  
access to those files (something like above).





I'm using BLCR version 0.6.5.
I generate checkpoints via 'ompi-checkpoint pid'
where pid is the pid of the mpirun task below

mpirun -np 2 -am ft-enable-cr ./xhpl



Are you running in a managed environment (e.g., using Torque or  
Slurm)? Odds are once you switched to root you lost your environmental  
symbols for your allocation (which is how Open MPI detects when to use  
an allocation). This would explain why the processes were restarted on  
one node instead of two.


ompi-restart uses mpirun underneath to do the process launch in  
exactly the same way the normal mpirun. So the mapping of processes  
should be the same. That being said there is a bug that I'm tracking  
in which they are not. This bug has nothing to do with restarting  
processes, and more with a bookkeeping error when using app files.





Thanks very much for any hints you can give on how to resolve either  
of these problems.


Let me know if this helps solve the problem. If not we might be able  
to try some other things.


Cheers,
Josh



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] (no subject)

2008-04-23 Thread Jeff Squyres

On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote:


Do you really mean that Open-MPI uses busy loop in order to handle
incomming calls? It seems to be incorrect since
spinning is a very bad and inefficient technique for this purpose.


It depends on what you're optimizing for.  :-)  We're optimizing for  
minimum message passing latency on hosts that are not oversubscribed;  
polling is very good at that.  Polling is much better than blocking,  
particularly if the blocking involves a system call (which will be  
"slow").  Note that in a compute-heavy environment, they nodes are  
going to be running at 100% CPU anyway.


Also keep in mind that you're only going to have "waste" spinning in  
MPI if you have a loosely/poorly synchronized application.  Granted,  
some applications are this way by nature, but we have not chosen to  
optimize spare CPU cycles for them.  As I said in a prior mail, adding  
a blocking strategy is on the to-do list, but it's fairly low in  
priority right now.  Someone may care / improve the message passing  
engine to include blocking, but it hasn't happened yet.  Want to work  
on it?  :-)


And for reference: almost all MPI's do busy polling to minimize  
latency.  Some of them will shift to blocking if nothing happens for a  
"long" time.  This second piece is what OMPI is lacking.



Why
don't you use blocking and/or signals instead of
that?


FWIW: I mentioned this in my other mail -- latency is quite definitely  
negatively impacted when you use such mechanisms.  Blocking and  
signals are "slow" (in comparison to polling).



I think the priority of this task is very high because polling
just wastes resources of the system.


In production HPC environments, the entire resource is dedicated to  
the MPI app anyway, so there's nothing else that really needs it.  So  
we allow them to busy-spin.


There is a mode to call yield() in the middle of every OMPI progress  
loop, but it's only helpful for loosely/poorly synchronized MPI apps  
and ones that use TCP or shared memory.  Low latency networks such as  
IB or Myrinet won't be as friendly to this setting because they're  
busy polling (i.e., they call yield() much less frequently, if at all).



On the other hand,
what Alberto claims is not reasonable to me.

Alberto,
- Are you oversubscribing one node which means that you are running  
your

code on a single processor machine, pretending
to have four CPUs?

- Did you compile Open-MPI or installed from RPM?

Receiving process shouldn't be that expensive.

Regards,

Danesh



Jeff Squyres skrev:

Because on-node communication typically uses shared memory, so we
currently have to poll.  Additionally, when using mixed on/off-node
communication, we have to alternate between polling shared memory and
polling the network.

Additionally, we actively poll because it's the best way to lower
latency.  MPI implementations are almost always first judged on their
latency, not [usually] their CPU utilization.  Going to sleep in a
blocking system call will definitely negatively impact latency.

We have plans for implementing the "spin for a while and then block"
technique (as has been used in other MPI's and middleware layers),  
but

it hasn't been a high priority.


On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote:


Thanks Torje. I wonder what is the benefit of looping on the  
incoming

message-queue socket rather than using system I/O signals, like read
() or select().

On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote:


Hi Alberto,

The blocked processes are in fact spin-waiting. While they don't  
have

anything better to do (waiting for that message), they will check
their incoming message-queues in a loop.

So the MPI_Recv()-operation is blocking, but it doesn't mean that  
the

processes are blocked by the OS scheduler.


I hope that made some sense :)


Best regards,

Torje


On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote:



I have simple MPI program that sends data to processor rank 0. The
communication works well but when I run the program on more than 2
processors (-np 4) the extra receivers waiting for data run on >  
90%

CPU load. I understand MPI_Recv() is a blocking operation, but why
does it consume so much CPU compared to a regular system read()?



#include 
#include 
#include 
#include 
#include 

void process_sender(int);
void process_receiver(int);


int main(int argc, char* argv[])
{
int rank;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

printf("Processor %d (%d) initialized\n", rank, getpid());

if( rank == 1 )
  process_sender(rank);
else
  process_receiver(rank);

MPI_Finalize();
}


void process_sender(int rank)
{
int i, j, size;
float data[100];
MPI_Status status;

printf("Processor %d initializing data...\n", rank);
for( i = 0; i < 100; ++i )
  data[i] = i;

MPI_Comm_size(MPI_COMM_WORLD, &size);

printf("Processor %d sending data...\n", rank);
MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
printf("Processor %d se

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-23 Thread Sharon Brunett

Josh Hursey wrote:

On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote:


Hello,
I'm using openmpi-1.3a1r18241 on a 2 node configuration and having  
troubles with the ompi-restart.  I can successfully ompi-checkpoint  
and ompi-restart a 1 way mpi code.

When I try a 2 way job running across 2 nodes, I get

bash-2.05b$ ompi-restart -verbose ompi_global_snapshot_926.ckpt
[shc005:01159] Checking for the existence of (/home/sharon/ 
ompi_global_snapshot_926.ckpt)

[shc005:01159] Restarting from file (ompi_global_snapshot_926.ckpt)
[shc005:01159]   Exec in self
Restart failed: Permission denied
Restart failed: Permission denied



This error is coming from BLCR. A few things to check.

First take a look at /var/log/messages on the machine(s) you are  
trying to restart on. Per:

  http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#eperm

Next check to make sure prelinking is turned off on the two machines  
you are using. Per:

  http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink

Those will rule out some common BLCR problems. (more below)



If I try running as root, using the same snapshot file, the code  
restarts ok, but both tasks and up on the same node, rather than one  
per node (like the original mpirun).


You should never have to run as root to restart a process (or to run  
Open MPI in any form). So I'm wondering if your user has permissions  
to access the checkpoint files that BLCR is generating. You can look  
at the permissions for the individual checkpoint files by looking into  
the checkpoint handler directory. They are a bit hidden, so something  
like the following should expose them:

---
shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/ 
opal_snapshot_0.ckpt/

total 1756
drwx--  2 sharon users4096 Apr 23 16:29 .
drwx--  4 sharon users4096 Apr 23 16:29 ..
-rw---  1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31849
-rw-r--r--  1 sharon users  35 Apr 23 16:29 snapshot_meta.data
shell$
shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/ 
opal_snapshot_1.ckpt/

total 1756
drwx--  2 sharon users4096 Apr 23 16:29 .
drwx--  4 sharon users4096 Apr 23 16:29 ..
-rw---  1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31850
-rw-r--r--  1 sharon users  35 Apr 23 16:29 snapshot_meta.data
---

The BLCR generated context files are "ompi_blcr_context.PID", and you  
need to check to make sure that you have sufficient permissions to  
access to those files (something like above).




I'm using BLCR version 0.6.5.
I generate checkpoints via 'ompi-checkpoint pid'
where pid is the pid of the mpirun task below

mpirun -np 2 -am ft-enable-cr ./xhpl



Are you running in a managed environment (e.g., using Torque or  
Slurm)? Odds are once you switched to root you lost your environmental  
symbols for your allocation (which is how Open MPI detects when to use  
an allocation). This would explain why the processes were restarted on  
one node instead of two.


ompi-restart uses mpirun underneath to do the process launch in  
exactly the same way the normal mpirun. So the mapping of processes  
should be the same. That being said there is a bug that I'm tracking  
in which they are not. This bug has nothing to do with restarting  
processes, and more with a bookkeeping error when using app files.



Thanks very much for any hints you can give on how to resolve either  
of these problems.


Let me know if this helps solve the problem. If not we might be able  
to try some other things.


Cheers,
Josh


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Thanks much..
vmadump: open('/var/run/nscd/passwd', 0x0) failed: -13
vmadump: mmap failed: /var/run/nscd/passwd

is indeed the problem, as shown by dmesg.


Re: [OMPI users] Processor affinitiy

2008-04-23 Thread Jeff Squyres

On Apr 23, 2008, at 3:01 PM, Alberto Giannetti wrote:


I would like to run one of my MPI processors to a single core on my
iMac Intel Core Duo system. I'm using release 1.2.4 on Darwin 8.11.1.
It looks like processor affinity is not supported for this kind of
configuration:


I'm afraid that OS X doesn't have an API for processor affinity (so  
there's no API for OMPI to call).  :-(


You might want to file a feature enhancement with Apple to have them  
add one.  :-)


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] (no subject)

2008-04-23 Thread Ingo Josopait
I can think of several advantages that using blocking or signals to
reduce the cpu load would have:

- Reduced energy consumption
- Running additional background programs could be done far more efficiently
- It would be much simpler to examine the load balance.

It may depend on the type of program and the computational environment,
but there are certainly many cases in which putting the system in idle
mode would be advantageous. This is especially true for programs with
low network traffic and/or high load imbalances.

The "spin for a while and then block" method that you mentioned earlier
seems to be a good compromise. Just do polling for some time that is
long compared to the corresponding system call, and then go to sleep if
nothing happens. In this way, the latency would be only marginally
increased, while less cpu time is wasted in the polling loops, and I
would be much happier.





Jeff Squyres schrieb:
> On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote:
> 
>> Do you really mean that Open-MPI uses busy loop in order to handle
>> incomming calls? It seems to be incorrect since
>> spinning is a very bad and inefficient technique for this purpose.
> 
> It depends on what you're optimizing for.  :-)  We're optimizing for  
> minimum message passing latency on hosts that are not oversubscribed;  
> polling is very good at that.  Polling is much better than blocking,  
> particularly if the blocking involves a system call (which will be  
> "slow").  Note that in a compute-heavy environment, they nodes are  
> going to be running at 100% CPU anyway.
> 
> Also keep in mind that you're only going to have "waste" spinning in  
> MPI if you have a loosely/poorly synchronized application.  Granted,  
> some applications are this way by nature, but we have not chosen to  
> optimize spare CPU cycles for them.  As I said in a prior mail, adding  
> a blocking strategy is on the to-do list, but it's fairly low in  
> priority right now.  Someone may care / improve the message passing  
> engine to include blocking, but it hasn't happened yet.  Want to work  
> on it?  :-)
> 
> And for reference: almost all MPI's do busy polling to minimize  
> latency.  Some of them will shift to blocking if nothing happens for a  
> "long" time.  This second piece is what OMPI is lacking.
> 
>> Why
>> don't you use blocking and/or signals instead of
>> that?
> 
> FWIW: I mentioned this in my other mail -- latency is quite definitely  
> negatively impacted when you use such mechanisms.  Blocking and  
> signals are "slow" (in comparison to polling).
> 
>> I think the priority of this task is very high because polling
>> just wastes resources of the system.
> 
> In production HPC environments, the entire resource is dedicated to  
> the MPI app anyway, so there's nothing else that really needs it.  So  
> we allow them to busy-spin.
> 
> There is a mode to call yield() in the middle of every OMPI progress  
> loop, but it's only helpful for loosely/poorly synchronized MPI apps  
> and ones that use TCP or shared memory.  Low latency networks such as  
> IB or Myrinet won't be as friendly to this setting because they're  
> busy polling (i.e., they call yield() much less frequently, if at all).
> 
>> On the other hand,
>> what Alberto claims is not reasonable to me.
>>
>> Alberto,
>> - Are you oversubscribing one node which means that you are running  
>> your
>> code on a single processor machine, pretending
>> to have four CPUs?
>>
>> - Did you compile Open-MPI or installed from RPM?
>>
>> Receiving process shouldn't be that expensive.
>>
>> Regards,
>>
>> Danesh
>>
>>
>>
>> Jeff Squyres skrev:
>>> Because on-node communication typically uses shared memory, so we
>>> currently have to poll.  Additionally, when using mixed on/off-node
>>> communication, we have to alternate between polling shared memory and
>>> polling the network.
>>>
>>> Additionally, we actively poll because it's the best way to lower
>>> latency.  MPI implementations are almost always first judged on their
>>> latency, not [usually] their CPU utilization.  Going to sleep in a
>>> blocking system call will definitely negatively impact latency.
>>>
>>> We have plans for implementing the "spin for a while and then block"
>>> technique (as has been used in other MPI's and middleware layers),  
>>> but
>>> it hasn't been a high priority.
>>>
>>>
>>> On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote:
>>>
>>>
 Thanks Torje. I wonder what is the benefit of looping on the  
 incoming
 message-queue socket rather than using system I/O signals, like read
 () or select().

 On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote:

> Hi Alberto,
>
> The blocked processes are in fact spin-waiting. While they don't  
> have
> anything better to do (waiting for that message), they will check
> their incoming message-queues in a loop.
>
> So the MPI_Recv()-operation is blocking, but it doesn't mean that  
> the

Re: [OMPI users] (no subject)

2008-04-23 Thread Jeff Squyres

Contributions are always welcome.  :-)

http://www.open-mpi.org/community/contribute/

To be less glib: Open MPI represents the union of the interests of its  
members.  So far, we've *talked* internally about adding a spin-then- 
block mechanism, but there's a non-trivial amount of work to make that  
happen.  Shared memory is the sticking point -- we have some good  
ideas how to make it work, but no one's had the time / resources to do  
it.  To be absolutely clear: no one's debating the value of adding  
blocking progress (as long as it's implemented in a way that  
absolutely does not affect the performance critical code path).  It's  
just that so far, it has not been important to any current member to  
add it (weighed against all the other features that we're working on).


If you (or anyone) find blocking progress important, we'd love for you  
to join Open MPI and contribute the the work necessary to make it  
happen.




On Apr 23, 2008, at 5:38 PM, Ingo Josopait wrote:


I can think of several advantages that using blocking or signals to
reduce the cpu load would have:

- Reduced energy consumption
- Running additional background programs could be done far more  
efficiently

- It would be much simpler to examine the load balance.

It may depend on the type of program and the computational  
environment,

but there are certainly many cases in which putting the system in idle
mode would be advantageous. This is especially true for programs with
low network traffic and/or high load imbalances.

The "spin for a while and then block" method that you mentioned  
earlier

seems to be a good compromise. Just do polling for some time that is
long compared to the corresponding system call, and then go to sleep  
if

nothing happens. In this way, the latency would be only marginally
increased, while less cpu time is wasted in the polling loops, and I
would be much happier.





Jeff Squyres schrieb:

On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote:


Do you really mean that Open-MPI uses busy loop in order to handle
incomming calls? It seems to be incorrect since
spinning is a very bad and inefficient technique for this purpose.


It depends on what you're optimizing for.  :-)  We're optimizing for
minimum message passing latency on hosts that are not oversubscribed;
polling is very good at that.  Polling is much better than blocking,
particularly if the blocking involves a system call (which will be
"slow").  Note that in a compute-heavy environment, they nodes are
going to be running at 100% CPU anyway.

Also keep in mind that you're only going to have "waste" spinning in
MPI if you have a loosely/poorly synchronized application.  Granted,
some applications are this way by nature, but we have not chosen to
optimize spare CPU cycles for them.  As I said in a prior mail,  
adding

a blocking strategy is on the to-do list, but it's fairly low in
priority right now.  Someone may care / improve the message passing
engine to include blocking, but it hasn't happened yet.  Want to work
on it?  :-)

And for reference: almost all MPI's do busy polling to minimize
latency.  Some of them will shift to blocking if nothing happens  
for a

"long" time.  This second piece is what OMPI is lacking.


Why
don't you use blocking and/or signals instead of
that?


FWIW: I mentioned this in my other mail -- latency is quite  
definitely

negatively impacted when you use such mechanisms.  Blocking and
signals are "slow" (in comparison to polling).


I think the priority of this task is very high because polling
just wastes resources of the system.


In production HPC environments, the entire resource is dedicated to
the MPI app anyway, so there's nothing else that really needs it.  So
we allow them to busy-spin.

There is a mode to call yield() in the middle of every OMPI progress
loop, but it's only helpful for loosely/poorly synchronized MPI apps
and ones that use TCP or shared memory.  Low latency networks such as
IB or Myrinet won't be as friendly to this setting because they're
busy polling (i.e., they call yield() much less frequently, if at  
all).



On the other hand,
what Alberto claims is not reasonable to me.

Alberto,
- Are you oversubscribing one node which means that you are running
your
code on a single processor machine, pretending
to have four CPUs?

- Did you compile Open-MPI or installed from RPM?

Receiving process shouldn't be that expensive.

Regards,

Danesh



Jeff Squyres skrev:

Because on-node communication typically uses shared memory, so we
currently have to poll.  Additionally, when using mixed on/off-node
communication, we have to alternate between polling shared memory  
and

polling the network.

Additionally, we actively poll because it's the best way to lower
latency.  MPI implementations are almost always first judged on  
their

latency, not [usually] their CPU utilization.  Going to sleep in a
blocking system call will definitely negatively impact latency.

We have plans for imple

[OMPI users] Busy waiting [was Re: (no subject)]

2008-04-23 Thread Barry Rountree
On Wed, Apr 23, 2008 at 11:38:41PM +0200, Ingo Josopait wrote:
> I can think of several advantages that using blocking or signals to
> reduce the cpu load would have:
> 
> - Reduced energy consumption

Not necessarily.  Any time the program ends up running longer, the
cluster is up and running (and wasting electricity) for that amount of
time.  In the case where lots of tiny messages are being sent you could
easily end up using more energy.  

> - Running additional background programs could be done far more efficiently

It's usually more efficient -- especially in terms of cache -- to batch
up programs to run one after the other instead of running them
simultaneously.  

> - It would be much simpler to examine the load balance.

This is true, but it's still pretty trivial to measure load imbalance.
MPI allows you to write a wrapper library that intercepts any MPI_*
call.  You can instrument the code however you like, then call PMPI_*,
then catch the return value, finish your instrumentation, and return
control to your program.  Here's some pseudocode:

int MPI_Barrier(MPI_Comm comm){
gettimeofday(&start, NULL);
rc=PMPI_Barrier( comm );
gettimeofday(&stop, NULL);
fprintf( logfile, "Barrier on node %d took %lf seconds\n",
rank, delta(&stop, &start) );
return rc;
}

I've got some code that does this for all of the MPI calls in OpenMPI
(ah, the joys of writing C code using python scripts).  Let me know if
you'd find it useful.

> It may depend on the type of program and the computational environment,
> but there are certainly many cases in which putting the system in idle
> mode would be advantageous. This is especially true for programs with
> low network traffic and/or high load imbalances.

  I could use a few more benchmarks like that.  Seriously, if
you're mostly concerned about saving energy, a quick hack is to set a
timer as soon as you enter an MPI call (say for 100ms) and if the timer
goes off while you're still in the call, use DVS to drop your CPU
frequency to the lowest value it has.  Then, when you exit the MPI call,
pop it back up to the highest frequency.  This can save a significant
amount of energy, but even here there can be a performance penalty.  For
example, UMT2K schleps around very large messages, and you really need
to be running as fast as possible during the MPI_Waitall calls or the
program will slow down by 1% or so (thus using more energy).

Doing this just for Barriers and Allreduces seems to speed up the
program a tiny bit, but I haven't done enough runs to make sure this
isn't an artifact.

(This is my dissertation topic, so before asking any question be advised
that I WILL talk your ear off.)

> The "spin for a while and then block" method that you mentioned earlier
> seems to be a good compromise. Just do polling for some time that is
> long compared to the corresponding system call, and then go to sleep if
> nothing happens. In this way, the latency would be only marginally
> increased, while less cpu time is wasted in the polling loops, and I
> would be much happier.
> 

I'm interested in seeing what this does for energy savings.  Are you
volunteering to test a patch?  (I've got four other papers I need to
get finished up, so it'll be a few weeks before I start coding.)

Barry Rountree
Ph.D. Candidate, Computer Science
University of Georgia

> 
> 
> 
> 
> Jeff Squyres schrieb:
> > On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote:
> > 
> >> Do you really mean that Open-MPI uses busy loop in order to handle
> >> incomming calls? It seems to be incorrect since
> >> spinning is a very bad and inefficient technique for this purpose.
> > 
> > It depends on what you're optimizing for.  :-)  We're optimizing for  
> > minimum message passing latency on hosts that are not oversubscribed;  
> > polling is very good at that.  Polling is much better than blocking,  
> > particularly if the blocking involves a system call (which will be  
> > "slow").  Note that in a compute-heavy environment, they nodes are  
> > going to be running at 100% CPU anyway.
> > 
> > Also keep in mind that you're only going to have "waste" spinning in  
> > MPI if you have a loosely/poorly synchronized application.  Granted,  
> > some applications are this way by nature, but we have not chosen to  
> > optimize spare CPU cycles for them.  As I said in a prior mail, adding  
> > a blocking strategy is on the to-do list, but it's fairly low in  
> > priority right now.  Someone may care / improve the message passing  
> > engine to include blocking, but it hasn't happened yet.  Want to work  
> > on it?  :-)
> > 
> > And for reference: almost all MPI's do busy polling to minimize  
> > latency.  Some of them will shift to blocking if nothing happens for a  
> > "long" time.  This second piece is what OMPI is lacking.
> > 
> >> Why
> >> don't you use blocking and/or signals instead of
> >> that?
> > 
> > FWIW: I mentioned this in my other mail -- latency i

Re: [OMPI users] Processor affinitiy

2008-04-23 Thread Alberto Giannetti
Note: I'm running Tiger (Darwin 8.11.1). Things might have changed  
with Leopard.


On Apr 23, 2008, at 5:30 PM, Jeff Squyres wrote:

On Apr 23, 2008, at 3:01 PM, Alberto Giannetti wrote:


I would like to run one of my MPI processors to a single core on my
iMac Intel Core Duo system. I'm using release 1.2.4 on Darwin 8.11.1.
It looks like processor affinity is not supported for this kind of
configuration:


I'm afraid that OS X doesn't have an API for processor affinity (so
there's no API for OMPI to call).  :-(

You might want to file a feature enhancement with Apple to have them
add one.  :-)

--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] (no subject)

2008-04-23 Thread Alberto Giannetti

No oversubscription. I did not recompiled OMPI or installed from RPM.

On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote:

Do you really mean that Open-MPI uses busy loop in order to handle
incomming calls? It seems to be incorrect since
spinning is a very bad and inefficient technique for this purpose. Why
don't you use blocking and/or signals instead of
that? I think the priority of this task is very high because polling
just wastes resources of the system. On the other hand,
what Alberto claims is not reasonable to me.

Alberto,
- Are you oversubscribing one node which means that you are running  
your

code on a single processor machine, pretending
to have four CPUs?

- Did you compile Open-MPI or installed from RPM?

Receiving process shouldn't be that expensive.

Regards,

Danesh



Jeff Squyres skrev:

Because on-node communication typically uses shared memory, so we
currently have to poll.  Additionally, when using mixed on/off-node
communication, we have to alternate between polling shared memory and
polling the network.

Additionally, we actively poll because it's the best way to lower
latency.  MPI implementations are almost always first judged on their
latency, not [usually] their CPU utilization.  Going to sleep in a
blocking system call will definitely negatively impact latency.

We have plans for implementing the "spin for a while and then block"
technique (as has been used in other MPI's and middleware layers),  
but

it hasn't been a high priority.


On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote:


Thanks Torje. I wonder what is the benefit of looping on the  
incoming

message-queue socket rather than using system I/O signals, like read
() or select().

On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote:


Hi Alberto,

The blocked processes are in fact spin-waiting. While they don't  
have

anything better to do (waiting for that message), they will check
their incoming message-queues in a loop.

So the MPI_Recv()-operation is blocking, but it doesn't mean  
that the

processes are blocked by the OS scheduler.


I hope that made some sense :)


Best regards,

Torje


On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote:



I have simple MPI program that sends data to processor rank 0. The
communication works well but when I run the program on more than 2
processors (-np 4) the extra receivers waiting for data run on  
> 90%

CPU load. I understand MPI_Recv() is a blocking operation, but why
does it consume so much CPU compared to a regular system read()?



#include 
#include 
#include 
#include 
#include 

void process_sender(int);
void process_receiver(int);


int main(int argc, char* argv[])
{
 int rank;

 MPI_Init(&argc, &argv);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);

 printf("Processor %d (%d) initialized\n", rank, getpid());

 if( rank == 1 )
   process_sender(rank);
 else
   process_receiver(rank);

 MPI_Finalize();
}


void process_sender(int rank)
{
 int i, j, size;
 float data[100];
 MPI_Status status;

 printf("Processor %d initializing data...\n", rank);
 for( i = 0; i < 100; ++i )
   data[i] = i;

 MPI_Comm_size(MPI_COMM_WORLD, &size);

 printf("Processor %d sending data...\n", rank);
 MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
 printf("Processor %d sent data\n", rank);
}


void process_receiver(int rank)
{
 int count;
 float value[200];
 MPI_Status status;

 printf("Processor %d waiting for data...\n", rank);
 MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55,
MPI_COMM_WORLD, &status);
 printf("Processor %d Got data from processor %d\n", rank,
status.MPI_SOURCE);
 MPI_Get_count(&status, MPI_FLOAT, &count);
 printf("Processor %d, Got %d elements\n", rank, count);
}

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Problem compiling open MPI on cygwin on windows

2008-04-23 Thread Michael

Hi,

New to open MPI, but have used MPI before.

I am trying to compile open MPI on cygwin on widows XP.  From what I 
have read this should work?


Initially I hit  a problem with the 1.2.6  standard download in that  
time related header file was incorrect and the mailing list pointed me 
to  the trunk  build to solve that problem.


Now when I try to compile I am  getting the following error at the 
bottom of this mail.


My question is am I wasting my time trying to use cygwin, or are there 
people out there using it on cygwin.  If so, is there a solution to the 
problem below?


Thanks in Advance,
Michael.

   mv -f $depbase.Tpo $depbase.Plo
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include 
-I../../../../orte/include -I../../../../ompi/include -I../../../../op
al/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -D_REENTRANT -O3 
-DNDEBUG -finline-functions -fno-strict-aliasing -MT paffinity_window
s_module.lo -MD -MP -MF .deps/paffinity_windows_module.Tpo -c 
paffinity_windows_module.c  -DDLL_EXPORT -DPIC -o 
.libs/paffinity_windows_modu

le.o
paffinity_windows_module.c:41: error: parse error before "sys_info"
paffinity_windows_module.c:41: warning: data definition has no type or 
storage class

paffinity_windows_module.c: In function `windows_module_get_num_procs':
paffinity_windows_module.c:90: error: request for member 
`dwNumberOfProcessors' in something not a structure or union

paffinity_windows_module.c: In function `windows_module_set':
paffinity_windows_module.c:96: error: `HANDLE' undeclared (first use in 
this function)
paffinity_windows_module.c:96: error: (Each undeclared identifier is 
reported only once

paffinity_windows_module.c:96: error: for each function it appears in.)
paffinity_windows_module.c:96: error: parse error before "threadid"
paffinity_windows_module.c:97: error: `DWORD_PTR' undeclared (first use 
in this function)
paffinity_windows_module.c:99: error: `threadid' undeclared (first use 
in this function)
paffinity_windows_module.c:99: error: `process_mask' undeclared (first 
use in this function)
paffinity_windows_module.c:99: error: `system_mask' undeclared (first 
use in this function)

paffinity_windows_module.c: In function `windows_module_get':
paffinity_windows_module.c:116: error: `HANDLE' undeclared (first use in 
this function)

paffinity_windows_module.c:116: error: parse error before "threadid"
paffinity_windows_module.c:117: error: `DWORD_PTR' undeclared (first use 
in this function)
paffinity_windows_module.c:119: error: `threadid' undeclared (first use 
in this function)
paffinity_windows_module.c:119: error: `process_mask' undeclared (first 
use in this function)
paffinity_windows_module.c:119: error: `system_mask' undeclared (first 
use in this function)

make[2]: *** [paffinity_windows_module.lo] Error 1
make[2]: Leaving directory 
`/home/Michael/mpi/openmpi-1.3a1r18208/opal/mca/paffinity/windows'

make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/Michael/mpi/openmpi-1.3a1r18208/opal'
make: *** [all-recursive] Error 1





Re: [OMPI users] Processor affinitiy

2008-04-23 Thread Jeff Squyres

Things have not changed with Leopard.  :-(

On Apr 23, 2008, at 6:26 PM, Alberto Giannetti wrote:


Note: I'm running Tiger (Darwin 8.11.1). Things might have changed
with Leopard.

On Apr 23, 2008, at 5:30 PM, Jeff Squyres wrote:

On Apr 23, 2008, at 3:01 PM, Alberto Giannetti wrote:


I would like to run one of my MPI processors to a single core on my
iMac Intel Core Duo system. I'm using release 1.2.4 on Darwin  
8.11.1.

It looks like processor affinity is not supported for this kind of
configuration:


I'm afraid that OS X doesn't have an API for processor affinity (so
there's no API for OMPI to call).  :-(

You might want to file a feature enhancement with Apple to have them
add one.  :-)

--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Problem compiling open MPI on cygwin on windows

2008-04-23 Thread George Bosilca
This component is not supposed to get included in a cygwin build. The  
name containing windows indicate it is for native windows compilation,  
and not for cygwin nor SUA. Last time I checked I manage to compile  
everything statically. Unfortunately, I never tried to do a dynamic  
build ... I'll give it a try tomorrow.


Thanks,
  george.


On Apr 23, 2008, at 7:51 PM, Michael wrote:


Hi,

New to open MPI, but have used MPI before.

I am trying to compile open MPI on cygwin on widows XP.  From what I
have read this should work?

Initially I hit  a problem with the 1.2.6  standard download in that
time related header file was incorrect and the mailing list pointed me
to  the trunk  build to solve that problem.

Now when I try to compile I am  getting the following error at the
bottom of this mail.

My question is am I wasting my time trying to use cygwin, or are there
people out there using it on cygwin.  If so, is there a solution to  
the

problem below?

Thanks in Advance,
Michael.

   mv -f $depbase.Tpo $depbase.Plo
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include
-I../../../../orte/include -I../../../../ompi/include -I../../../../op
al/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -D_REENTRANT -O3
-DNDEBUG -finline-functions -fno-strict-aliasing -MT paffinity_window
s_module.lo -MD -MP -MF .deps/paffinity_windows_module.Tpo -c
paffinity_windows_module.c  -DDLL_EXPORT -DPIC -o
.libs/paffinity_windows_modu
le.o
paffinity_windows_module.c:41: error: parse error before "sys_info"
paffinity_windows_module.c:41: warning: data definition has no type or
storage class
paffinity_windows_module.c: In function  
`windows_module_get_num_procs':

paffinity_windows_module.c:90: error: request for member
`dwNumberOfProcessors' in something not a structure or union
paffinity_windows_module.c: In function `windows_module_set':
paffinity_windows_module.c:96: error: `HANDLE' undeclared (first use  
in

this function)
paffinity_windows_module.c:96: error: (Each undeclared identifier is
reported only once
paffinity_windows_module.c:96: error: for each function it appears  
in.)

paffinity_windows_module.c:96: error: parse error before "threadid"
paffinity_windows_module.c:97: error: `DWORD_PTR' undeclared (first  
use

in this function)
paffinity_windows_module.c:99: error: `threadid' undeclared (first use
in this function)
paffinity_windows_module.c:99: error: `process_mask' undeclared (first
use in this function)
paffinity_windows_module.c:99: error: `system_mask' undeclared (first
use in this function)
paffinity_windows_module.c: In function `windows_module_get':
paffinity_windows_module.c:116: error: `HANDLE' undeclared (first  
use in

this function)
paffinity_windows_module.c:116: error: parse error before "threadid"
paffinity_windows_module.c:117: error: `DWORD_PTR' undeclared (first  
use

in this function)
paffinity_windows_module.c:119: error: `threadid' undeclared (first  
use

in this function)
paffinity_windows_module.c:119: error: `process_mask' undeclared  
(first

use in this function)
paffinity_windows_module.c:119: error: `system_mask' undeclared (first
use in this function)
make[2]: *** [paffinity_windows_module.lo] Error 1
make[2]: Leaving directory
`/home/Michael/mpi/openmpi-1.3a1r18208/opal/mca/paffinity/windows'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/Michael/mpi/openmpi-1.3a1r18208/ 
opal'

make: *** [all-recursive] Error 1



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




smime.p7s
Description: S/MIME cryptographic signature