[OMPI users] (no subject)
I have simple MPI program that sends data to processor rank 0. The communication works well but when I run the program on more than 2 processors (-np 4) the extra receivers waiting for data run on > 90% CPU load. I understand MPI_Recv() is a blocking operation, but why does it consume so much CPU compared to a regular system read()? #include #include #include #include #include void process_sender(int); void process_receiver(int); int main(int argc, char* argv[]) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("Processor %d (%d) initialized\n", rank, getpid()); if( rank == 1 ) process_sender(rank); else process_receiver(rank); MPI_Finalize(); } void process_sender(int rank) { int i, j, size; float data[100]; MPI_Status status; printf("Processor %d initializing data...\n", rank); for( i = 0; i < 100; ++i ) data[i] = i; MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Processor %d sending data...\n", rank); MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD); printf("Processor %d sent data\n", rank); } void process_receiver(int rank) { int count; float value[200]; MPI_Status status; printf("Processor %d waiting for data...\n", rank); MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55, MPI_COMM_WORLD, &status); printf("Processor %d Got data from processor %d\n", rank, status.MPI_SOURCE); MPI_Get_count(&status, MPI_FLOAT, &count); printf("Processor %d, Got %d elements\n", rank, count); }
Re: [OMPI users] (no subject)
Hi Alberto, The blocked processes are in fact spin-waiting. While they don't have anything better to do (waiting for that message), they will check their incoming message-queues in a loop. So the MPI_Recv()-operation is blocking, but it doesn't mean that the processes are blocked by the OS scheduler. I hope that made some sense :) Best regards, Torje On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote: I have simple MPI program that sends data to processor rank 0. The communication works well but when I run the program on more than 2 processors (-np 4) the extra receivers waiting for data run on > 90% CPU load. I understand MPI_Recv() is a blocking operation, but why does it consume so much CPU compared to a regular system read()? #include #include #include #include #include void process_sender(int); void process_receiver(int); int main(int argc, char* argv[]) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("Processor %d (%d) initialized\n", rank, getpid()); if( rank == 1 ) process_sender(rank); else process_receiver(rank); MPI_Finalize(); } void process_sender(int rank) { int i, j, size; float data[100]; MPI_Status status; printf("Processor %d initializing data...\n", rank); for( i = 0; i < 100; ++i ) data[i] = i; MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Processor %d sending data...\n", rank); MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD); printf("Processor %d sent data\n", rank); } void process_receiver(int rank) { int count; float value[200]; MPI_Status status; printf("Processor %d waiting for data...\n", rank); MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55, MPI_COMM_WORLD, &status); printf("Processor %d Got data from processor %d\n", rank, status.MPI_SOURCE); MPI_Get_count(&status, MPI_FLOAT, &count); printf("Processor %d, Got %d elements\n", rank, count); } ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] idle calls?
I noticed that the cpu usage of an mpi program is always at 100 percent, even if the tasks are doing nothing but wait for new data to arrive. Is there an option to change this behavior, so that the tasks sleep until new data arrive? Why is this the default behavior, anyway? Is it really so costly to set the task asleep when it is idle? <>
Re: [OMPI users] (no subject)
Thanks Torje. I wonder what is the benefit of looping on the incoming message-queue socket rather than using system I/O signals, like read () or select(). On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote: Hi Alberto, The blocked processes are in fact spin-waiting. While they don't have anything better to do (waiting for that message), they will check their incoming message-queues in a loop. So the MPI_Recv()-operation is blocking, but it doesn't mean that the processes are blocked by the OS scheduler. I hope that made some sense :) Best regards, Torje On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote: I have simple MPI program that sends data to processor rank 0. The communication works well but when I run the program on more than 2 processors (-np 4) the extra receivers waiting for data run on > 90% CPU load. I understand MPI_Recv() is a blocking operation, but why does it consume so much CPU compared to a regular system read()? #include #include #include #include #include void process_sender(int); void process_receiver(int); int main(int argc, char* argv[]) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("Processor %d (%d) initialized\n", rank, getpid()); if( rank == 1 ) process_sender(rank); else process_receiver(rank); MPI_Finalize(); } void process_sender(int rank) { int i, j, size; float data[100]; MPI_Status status; printf("Processor %d initializing data...\n", rank); for( i = 0; i < 100; ++i ) data[i] = i; MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Processor %d sending data...\n", rank); MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD); printf("Processor %d sent data\n", rank); } void process_receiver(int rank) { int count; float value[200]; MPI_Status status; printf("Processor %d waiting for data...\n", rank); MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55, MPI_COMM_WORLD, &status); printf("Processor %d Got data from processor %d\n", rank, status.MPI_SOURCE); MPI_Get_count(&status, MPI_FLOAT, &count); printf("Processor %d, Got %d elements\n", rank, count); } ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] (no subject)
Because on-node communication typically uses shared memory, so we currently have to poll. Additionally, when using mixed on/off-node communication, we have to alternate between polling shared memory and polling the network. Additionally, we actively poll because it's the best way to lower latency. MPI implementations are almost always first judged on their latency, not [usually] their CPU utilization. Going to sleep in a blocking system call will definitely negatively impact latency. We have plans for implementing the "spin for a while and then block" technique (as has been used in other MPI's and middleware layers), but it hasn't been a high priority. On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote: Thanks Torje. I wonder what is the benefit of looping on the incoming message-queue socket rather than using system I/O signals, like read () or select(). On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote: Hi Alberto, The blocked processes are in fact spin-waiting. While they don't have anything better to do (waiting for that message), they will check their incoming message-queues in a loop. So the MPI_Recv()-operation is blocking, but it doesn't mean that the processes are blocked by the OS scheduler. I hope that made some sense :) Best regards, Torje On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote: I have simple MPI program that sends data to processor rank 0. The communication works well but when I run the program on more than 2 processors (-np 4) the extra receivers waiting for data run on > 90% CPU load. I understand MPI_Recv() is a blocking operation, but why does it consume so much CPU compared to a regular system read()? #include #include #include #include #include void process_sender(int); void process_receiver(int); int main(int argc, char* argv[]) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("Processor %d (%d) initialized\n", rank, getpid()); if( rank == 1 ) process_sender(rank); else process_receiver(rank); MPI_Finalize(); } void process_sender(int rank) { int i, j, size; float data[100]; MPI_Status status; printf("Processor %d initializing data...\n", rank); for( i = 0; i < 100; ++i ) data[i] = i; MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Processor %d sending data...\n", rank); MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD); printf("Processor %d sent data\n", rank); } void process_receiver(int rank) { int count; float value[200]; MPI_Status status; printf("Processor %d waiting for data...\n", rank); MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55, MPI_COMM_WORLD, &status); printf("Processor %d Got data from processor %d\n", rank, status.MPI_SOURCE); MPI_Get_count(&status, MPI_FLOAT, &count); printf("Processor %d, Got %d elements\n", rank, count); } ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] idle calls?
Please see another ongoing thread on this list about this exact topic: http://www.open-mpi.org/community/lists/users/2008/04/5457.php It unfortunately has a subject of "(no subject)", so it's not obvious that this is what the thread is about. On Apr 23, 2008, at 12:14 PM, Ingo Josopait wrote: I noticed that the cpu usage of an mpi program is always at 100 percent, even if the tasks are doing nothing but wait for new data to arrive. Is there an option to change this behavior, so that the tasks sleep until new data arrive? Why is this the default behavior, anyway? Is it really so costly to set the task asleep when it is idle? ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] (no subject)
I am running the test program on Darwin 8.11.1, 1.83 Ghz Intel dual core. My Open MPI install is 1.2.4. I can't see any allocated shared memory segment on my system (ipcs - m), although the receiver opens a couple of TCP sockets in listening mode. It looks like my implementation does not use shared memory. Is this a configuration issue? a.out 5628 albertogiannetti3u unixR,W,NB 0x380b198 0t0 ->0x41ced48 a.out 5628 albertogiannetti4u unix R,W 0x41ced48 0t0 ->0x380b198 a.out 5628 albertogiannetti5u IPv4R,W,NB 0x3d4d920 0t0 TCP *:50969 (LISTEN) a.out 5628 albertogiannetti6u IPv4R,W,NB 0x3e62394 0t0 TCP 192.168.0.10:50970->192.168.0.10:50962 (ESTABLISHED) a.out 5628 albertogiannetti7u IPv4R,W,NB 0x422d228 0t0 TCP *:50973 (LISTEN) a.out 5628 albertogiannetti8u IPv4R,W,NB 0x2dfd394 0t0 TCP 192.168.0.10:50969->192.168.0.10:50975 (ESTABLISHED) On Apr 23, 2008, at 12:34 PM, Jeff Squyres wrote: Because on-node communication typically uses shared memory, so we currently have to poll. Additionally, when using mixed on/off-node communication, we have to alternate between polling shared memory and polling the network. Additionally, we actively poll because it's the best way to lower latency. MPI implementations are almost always first judged on their latency, not [usually] their CPU utilization. Going to sleep in a blocking system call will definitely negatively impact latency. We have plans for implementing the "spin for a while and then block" technique (as has been used in other MPI's and middleware layers), but it hasn't been a high priority. On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote: Thanks Torje. I wonder what is the benefit of looping on the incoming message-queue socket rather than using system I/O signals, like read () or select(). On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote: Hi Alberto, The blocked processes are in fact spin-waiting. While they don't have anything better to do (waiting for that message), they will check their incoming message-queues in a loop. So the MPI_Recv()-operation is blocking, but it doesn't mean that the processes are blocked by the OS scheduler. I hope that made some sense :) Best regards, Torje On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote: I have simple MPI program that sends data to processor rank 0. The communication works well but when I run the program on more than 2 processors (-np 4) the extra receivers waiting for data run on > 90% CPU load. I understand MPI_Recv() is a blocking operation, but why does it consume so much CPU compared to a regular system read()? #include #include #include #include #include void process_sender(int); void process_receiver(int); int main(int argc, char* argv[]) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("Processor %d (%d) initialized\n", rank, getpid()); if( rank == 1 ) process_sender(rank); else process_receiver(rank); MPI_Finalize(); } void process_sender(int rank) { int i, j, size; float data[100]; MPI_Status status; printf("Processor %d initializing data...\n", rank); for( i = 0; i < 100; ++i ) data[i] = i; MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Processor %d sending data...\n", rank); MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD); printf("Processor %d sent data\n", rank); } void process_receiver(int rank) { int count; float value[200]; MPI_Status status; printf("Processor %d waiting for data...\n", rank); MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55, MPI_COMM_WORLD, &status); printf("Processor %d Got data from processor %d\n", rank, status.MPI_SOURCE); MPI_Get_count(&status, MPI_FLOAT, &count); printf("Processor %d, Got %d elements\n", rank, count); } ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] (no subject)
OMPI doesn't use SYSV shared memory; it uses mmaped files. ompi_info will tell you all about the components installed. If you see a BTL component named "sm", then shared memory support is installed. I do not believe that we conditionally install sm on Linux or OS X systems -- it should always be installed. ompi_info | grep btl On Apr 23, 2008, at 2:55 PM, Alberto Giannetti wrote: I am running the test program on Darwin 8.11.1, 1.83 Ghz Intel dual core. My Open MPI install is 1.2.4. I can't see any allocated shared memory segment on my system (ipcs - m), although the receiver opens a couple of TCP sockets in listening mode. It looks like my implementation does not use shared memory. Is this a configuration issue? a.out 5628 albertogiannetti3u unixR,W,NB 0x380b198 0t0 ->0x41ced48 a.out 5628 albertogiannetti4u unix R,W 0x41ced48 0t0 ->0x380b198 a.out 5628 albertogiannetti5u IPv4R,W,NB 0x3d4d920 0t0 TCP *:50969 (LISTEN) a.out 5628 albertogiannetti6u IPv4R,W,NB 0x3e62394 0t0 TCP 192.168.0.10:50970->192.168.0.10:50962 (ESTABLISHED) a.out 5628 albertogiannetti7u IPv4R,W,NB 0x422d228 0t0 TCP *:50973 (LISTEN) a.out 5628 albertogiannetti8u IPv4R,W,NB 0x2dfd394 0t0 TCP 192.168.0.10:50969->192.168.0.10:50975 (ESTABLISHED) On Apr 23, 2008, at 12:34 PM, Jeff Squyres wrote: Because on-node communication typically uses shared memory, so we currently have to poll. Additionally, when using mixed on/off-node communication, we have to alternate between polling shared memory and polling the network. Additionally, we actively poll because it's the best way to lower latency. MPI implementations are almost always first judged on their latency, not [usually] their CPU utilization. Going to sleep in a blocking system call will definitely negatively impact latency. We have plans for implementing the "spin for a while and then block" technique (as has been used in other MPI's and middleware layers), but it hasn't been a high priority. On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote: Thanks Torje. I wonder what is the benefit of looping on the incoming message-queue socket rather than using system I/O signals, like read () or select(). On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote: Hi Alberto, The blocked processes are in fact spin-waiting. While they don't have anything better to do (waiting for that message), they will check their incoming message-queues in a loop. So the MPI_Recv()-operation is blocking, but it doesn't mean that the processes are blocked by the OS scheduler. I hope that made some sense :) Best regards, Torje On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote: I have simple MPI program that sends data to processor rank 0. The communication works well but when I run the program on more than 2 processors (-np 4) the extra receivers waiting for data run on > 90% CPU load. I understand MPI_Recv() is a blocking operation, but why does it consume so much CPU compared to a regular system read()? #include #include #include #include #include void process_sender(int); void process_receiver(int); int main(int argc, char* argv[]) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("Processor %d (%d) initialized\n", rank, getpid()); if( rank == 1 ) process_sender(rank); else process_receiver(rank); MPI_Finalize(); } void process_sender(int rank) { int i, j, size; float data[100]; MPI_Status status; printf("Processor %d initializing data...\n", rank); for( i = 0; i < 100; ++i ) data[i] = i; MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Processor %d sending data...\n", rank); MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD); printf("Processor %d sent data\n", rank); } void process_receiver(int rank) { int count; float value[200]; MPI_Status status; printf("Processor %d waiting for data...\n", rank); MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55, MPI_COMM_WORLD, &status); printf("Processor %d Got data from processor %d\n", rank, status.MPI_SOURCE); MPI_Get_count(&status, MPI_FLOAT, &count); printf("Processor %d, Got %d elements\n", rank, count); } ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailma
[OMPI users] Processor affinitiy
I would like to run one of my MPI processors to a single core on my iMac Intel Core Duo system. I'm using release 1.2.4 on Darwin 8.11.1. It looks like processor affinity is not supported for this kind of configuration: $ ompi_info|grep affinity MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.4) Open MPI: 1.2.4 Open MPI SVN revision: r16187 Open RTE: 1.2.4 Open RTE SVN revision: r16187 OPAL: 1.2.4 OPAL SVN revision: r16187 Prefix: /usr/local/ Configured architecture: i386-apple-darwin8.10.0 Configured by: brbarret Configured on: Tue Oct 16 12:03:41 EDT 2007 Configure host: ford.osl.iu.edu Built by: brbarret Built on: Tue Oct 16 12:15:57 EDT 2007 Built host: ford.osl.iu.edu Can I get processor affinity recompiling the distribution, maybe with different option?
Re: [OMPI users] (no subject)
Do you really mean that Open-MPI uses busy loop in order to handle incomming calls? It seems to be incorrect since spinning is a very bad and inefficient technique for this purpose. Why don't you use blocking and/or signals instead of that? I think the priority of this task is very high because polling just wastes resources of the system. On the other hand, what Alberto claims is not reasonable to me. Alberto, - Are you oversubscribing one node which means that you are running your code on a single processor machine, pretending to have four CPUs? - Did you compile Open-MPI or installed from RPM? Receiving process shouldn't be that expensive. Regards, Danesh Jeff Squyres skrev: Because on-node communication typically uses shared memory, so we currently have to poll. Additionally, when using mixed on/off-node communication, we have to alternate between polling shared memory and polling the network. Additionally, we actively poll because it's the best way to lower latency. MPI implementations are almost always first judged on their latency, not [usually] their CPU utilization. Going to sleep in a blocking system call will definitely negatively impact latency. We have plans for implementing the "spin for a while and then block" technique (as has been used in other MPI's and middleware layers), but it hasn't been a high priority. On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote: Thanks Torje. I wonder what is the benefit of looping on the incoming message-queue socket rather than using system I/O signals, like read () or select(). On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote: Hi Alberto, The blocked processes are in fact spin-waiting. While they don't have anything better to do (waiting for that message), they will check their incoming message-queues in a loop. So the MPI_Recv()-operation is blocking, but it doesn't mean that the processes are blocked by the OS scheduler. I hope that made some sense :) Best regards, Torje On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote: I have simple MPI program that sends data to processor rank 0. The communication works well but when I run the program on more than 2 processors (-np 4) the extra receivers waiting for data run on > 90% CPU load. I understand MPI_Recv() is a blocking operation, but why does it consume so much CPU compared to a regular system read()? #include #include #include #include #include void process_sender(int); void process_receiver(int); int main(int argc, char* argv[]) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("Processor %d (%d) initialized\n", rank, getpid()); if( rank == 1 ) process_sender(rank); else process_receiver(rank); MPI_Finalize(); } void process_sender(int rank) { int i, j, size; float data[100]; MPI_Status status; printf("Processor %d initializing data...\n", rank); for( i = 0; i < 100; ++i ) data[i] = i; MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Processor %d sending data...\n", rank); MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD); printf("Processor %d sent data\n", rank); } void process_receiver(int rank) { int count; float value[200]; MPI_Status status; printf("Processor %d waiting for data...\n", rank); MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55, MPI_COMM_WORLD, &status); printf("Processor %d Got data from processor %d\n", rank, status.MPI_SOURCE); MPI_Get_count(&status, MPI_FLOAT, &count); printf("Processor %d, Got %d elements\n", rank, count); } ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] openmpi-1.3a1r18241 ompi-restart issue
Hello, I'm using openmpi-1.3a1r18241 on a 2 node configuration and having troubles with the ompi-restart. I can successfully ompi-checkpoint and ompi-restart a 1 way mpi code. When I try a 2 way job running across 2 nodes, I get bash-2.05b$ ompi-restart -verbose ompi_global_snapshot_926.ckpt [shc005:01159] Checking for the existence of (/home/sharon/ompi_global_snapshot_926.ckpt) [shc005:01159] Restarting from file (ompi_global_snapshot_926.ckpt) [shc005:01159] Exec in self Restart failed: Permission denied Restart failed: Permission denied If I try running as root, using the same snapshot file, the code restarts ok, but both tasks and up on the same node, rather than one per node (like the original mpirun). I'm using BLCR version 0.6.5. I generate checkpoints via 'ompi-checkpoint pid' where pid is the pid of the mpirun task below mpirun -np 2 -am ft-enable-cr ./xhpl Thanks very much for any hints you can give on how to resolve either of these problems.
Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote: Hello, I'm using openmpi-1.3a1r18241 on a 2 node configuration and having troubles with the ompi-restart. I can successfully ompi-checkpoint and ompi-restart a 1 way mpi code. When I try a 2 way job running across 2 nodes, I get bash-2.05b$ ompi-restart -verbose ompi_global_snapshot_926.ckpt [shc005:01159] Checking for the existence of (/home/sharon/ ompi_global_snapshot_926.ckpt) [shc005:01159] Restarting from file (ompi_global_snapshot_926.ckpt) [shc005:01159] Exec in self Restart failed: Permission denied Restart failed: Permission denied This error is coming from BLCR. A few things to check. First take a look at /var/log/messages on the machine(s) you are trying to restart on. Per: http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#eperm Next check to make sure prelinking is turned off on the two machines you are using. Per: http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink Those will rule out some common BLCR problems. (more below) If I try running as root, using the same snapshot file, the code restarts ok, but both tasks and up on the same node, rather than one per node (like the original mpirun). You should never have to run as root to restart a process (or to run Open MPI in any form). So I'm wondering if your user has permissions to access the checkpoint files that BLCR is generating. You can look at the permissions for the individual checkpoint files by looking into the checkpoint handler directory. They are a bit hidden, so something like the following should expose them: --- shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/ opal_snapshot_0.ckpt/ total 1756 drwx-- 2 sharon users4096 Apr 23 16:29 . drwx-- 4 sharon users4096 Apr 23 16:29 .. -rw--- 1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31849 -rw-r--r-- 1 sharon users 35 Apr 23 16:29 snapshot_meta.data shell$ shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/ opal_snapshot_1.ckpt/ total 1756 drwx-- 2 sharon users4096 Apr 23 16:29 . drwx-- 4 sharon users4096 Apr 23 16:29 .. -rw--- 1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31850 -rw-r--r-- 1 sharon users 35 Apr 23 16:29 snapshot_meta.data --- The BLCR generated context files are "ompi_blcr_context.PID", and you need to check to make sure that you have sufficient permissions to access to those files (something like above). I'm using BLCR version 0.6.5. I generate checkpoints via 'ompi-checkpoint pid' where pid is the pid of the mpirun task below mpirun -np 2 -am ft-enable-cr ./xhpl Are you running in a managed environment (e.g., using Torque or Slurm)? Odds are once you switched to root you lost your environmental symbols for your allocation (which is how Open MPI detects when to use an allocation). This would explain why the processes were restarted on one node instead of two. ompi-restart uses mpirun underneath to do the process launch in exactly the same way the normal mpirun. So the mapping of processes should be the same. That being said there is a bug that I'm tracking in which they are not. This bug has nothing to do with restarting processes, and more with a bookkeeping error when using app files. Thanks very much for any hints you can give on how to resolve either of these problems. Let me know if this helps solve the problem. If not we might be able to try some other things. Cheers, Josh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] (no subject)
On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote: Do you really mean that Open-MPI uses busy loop in order to handle incomming calls? It seems to be incorrect since spinning is a very bad and inefficient technique for this purpose. It depends on what you're optimizing for. :-) We're optimizing for minimum message passing latency on hosts that are not oversubscribed; polling is very good at that. Polling is much better than blocking, particularly if the blocking involves a system call (which will be "slow"). Note that in a compute-heavy environment, they nodes are going to be running at 100% CPU anyway. Also keep in mind that you're only going to have "waste" spinning in MPI if you have a loosely/poorly synchronized application. Granted, some applications are this way by nature, but we have not chosen to optimize spare CPU cycles for them. As I said in a prior mail, adding a blocking strategy is on the to-do list, but it's fairly low in priority right now. Someone may care / improve the message passing engine to include blocking, but it hasn't happened yet. Want to work on it? :-) And for reference: almost all MPI's do busy polling to minimize latency. Some of them will shift to blocking if nothing happens for a "long" time. This second piece is what OMPI is lacking. Why don't you use blocking and/or signals instead of that? FWIW: I mentioned this in my other mail -- latency is quite definitely negatively impacted when you use such mechanisms. Blocking and signals are "slow" (in comparison to polling). I think the priority of this task is very high because polling just wastes resources of the system. In production HPC environments, the entire resource is dedicated to the MPI app anyway, so there's nothing else that really needs it. So we allow them to busy-spin. There is a mode to call yield() in the middle of every OMPI progress loop, but it's only helpful for loosely/poorly synchronized MPI apps and ones that use TCP or shared memory. Low latency networks such as IB or Myrinet won't be as friendly to this setting because they're busy polling (i.e., they call yield() much less frequently, if at all). On the other hand, what Alberto claims is not reasonable to me. Alberto, - Are you oversubscribing one node which means that you are running your code on a single processor machine, pretending to have four CPUs? - Did you compile Open-MPI or installed from RPM? Receiving process shouldn't be that expensive. Regards, Danesh Jeff Squyres skrev: Because on-node communication typically uses shared memory, so we currently have to poll. Additionally, when using mixed on/off-node communication, we have to alternate between polling shared memory and polling the network. Additionally, we actively poll because it's the best way to lower latency. MPI implementations are almost always first judged on their latency, not [usually] their CPU utilization. Going to sleep in a blocking system call will definitely negatively impact latency. We have plans for implementing the "spin for a while and then block" technique (as has been used in other MPI's and middleware layers), but it hasn't been a high priority. On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote: Thanks Torje. I wonder what is the benefit of looping on the incoming message-queue socket rather than using system I/O signals, like read () or select(). On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote: Hi Alberto, The blocked processes are in fact spin-waiting. While they don't have anything better to do (waiting for that message), they will check their incoming message-queues in a loop. So the MPI_Recv()-operation is blocking, but it doesn't mean that the processes are blocked by the OS scheduler. I hope that made some sense :) Best regards, Torje On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote: I have simple MPI program that sends data to processor rank 0. The communication works well but when I run the program on more than 2 processors (-np 4) the extra receivers waiting for data run on > 90% CPU load. I understand MPI_Recv() is a blocking operation, but why does it consume so much CPU compared to a regular system read()? #include #include #include #include #include void process_sender(int); void process_receiver(int); int main(int argc, char* argv[]) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("Processor %d (%d) initialized\n", rank, getpid()); if( rank == 1 ) process_sender(rank); else process_receiver(rank); MPI_Finalize(); } void process_sender(int rank) { int i, j, size; float data[100]; MPI_Status status; printf("Processor %d initializing data...\n", rank); for( i = 0; i < 100; ++i ) data[i] = i; MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Processor %d sending data...\n", rank); MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD); printf("Processor %d se
Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
Josh Hursey wrote: On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote: Hello, I'm using openmpi-1.3a1r18241 on a 2 node configuration and having troubles with the ompi-restart. I can successfully ompi-checkpoint and ompi-restart a 1 way mpi code. When I try a 2 way job running across 2 nodes, I get bash-2.05b$ ompi-restart -verbose ompi_global_snapshot_926.ckpt [shc005:01159] Checking for the existence of (/home/sharon/ ompi_global_snapshot_926.ckpt) [shc005:01159] Restarting from file (ompi_global_snapshot_926.ckpt) [shc005:01159] Exec in self Restart failed: Permission denied Restart failed: Permission denied This error is coming from BLCR. A few things to check. First take a look at /var/log/messages on the machine(s) you are trying to restart on. Per: http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#eperm Next check to make sure prelinking is turned off on the two machines you are using. Per: http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink Those will rule out some common BLCR problems. (more below) If I try running as root, using the same snapshot file, the code restarts ok, but both tasks and up on the same node, rather than one per node (like the original mpirun). You should never have to run as root to restart a process (or to run Open MPI in any form). So I'm wondering if your user has permissions to access the checkpoint files that BLCR is generating. You can look at the permissions for the individual checkpoint files by looking into the checkpoint handler directory. They are a bit hidden, so something like the following should expose them: --- shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/ opal_snapshot_0.ckpt/ total 1756 drwx-- 2 sharon users4096 Apr 23 16:29 . drwx-- 4 sharon users4096 Apr 23 16:29 .. -rw--- 1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31849 -rw-r--r-- 1 sharon users 35 Apr 23 16:29 snapshot_meta.data shell$ shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/ opal_snapshot_1.ckpt/ total 1756 drwx-- 2 sharon users4096 Apr 23 16:29 . drwx-- 4 sharon users4096 Apr 23 16:29 .. -rw--- 1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31850 -rw-r--r-- 1 sharon users 35 Apr 23 16:29 snapshot_meta.data --- The BLCR generated context files are "ompi_blcr_context.PID", and you need to check to make sure that you have sufficient permissions to access to those files (something like above). I'm using BLCR version 0.6.5. I generate checkpoints via 'ompi-checkpoint pid' where pid is the pid of the mpirun task below mpirun -np 2 -am ft-enable-cr ./xhpl Are you running in a managed environment (e.g., using Torque or Slurm)? Odds are once you switched to root you lost your environmental symbols for your allocation (which is how Open MPI detects when to use an allocation). This would explain why the processes were restarted on one node instead of two. ompi-restart uses mpirun underneath to do the process launch in exactly the same way the normal mpirun. So the mapping of processes should be the same. That being said there is a bug that I'm tracking in which they are not. This bug has nothing to do with restarting processes, and more with a bookkeeping error when using app files. Thanks very much for any hints you can give on how to resolve either of these problems. Let me know if this helps solve the problem. If not we might be able to try some other things. Cheers, Josh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Thanks much.. vmadump: open('/var/run/nscd/passwd', 0x0) failed: -13 vmadump: mmap failed: /var/run/nscd/passwd is indeed the problem, as shown by dmesg.
Re: [OMPI users] Processor affinitiy
On Apr 23, 2008, at 3:01 PM, Alberto Giannetti wrote: I would like to run one of my MPI processors to a single core on my iMac Intel Core Duo system. I'm using release 1.2.4 on Darwin 8.11.1. It looks like processor affinity is not supported for this kind of configuration: I'm afraid that OS X doesn't have an API for processor affinity (so there's no API for OMPI to call). :-( You might want to file a feature enhancement with Apple to have them add one. :-) -- Jeff Squyres Cisco Systems
Re: [OMPI users] (no subject)
I can think of several advantages that using blocking or signals to reduce the cpu load would have: - Reduced energy consumption - Running additional background programs could be done far more efficiently - It would be much simpler to examine the load balance. It may depend on the type of program and the computational environment, but there are certainly many cases in which putting the system in idle mode would be advantageous. This is especially true for programs with low network traffic and/or high load imbalances. The "spin for a while and then block" method that you mentioned earlier seems to be a good compromise. Just do polling for some time that is long compared to the corresponding system call, and then go to sleep if nothing happens. In this way, the latency would be only marginally increased, while less cpu time is wasted in the polling loops, and I would be much happier. Jeff Squyres schrieb: > On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote: > >> Do you really mean that Open-MPI uses busy loop in order to handle >> incomming calls? It seems to be incorrect since >> spinning is a very bad and inefficient technique for this purpose. > > It depends on what you're optimizing for. :-) We're optimizing for > minimum message passing latency on hosts that are not oversubscribed; > polling is very good at that. Polling is much better than blocking, > particularly if the blocking involves a system call (which will be > "slow"). Note that in a compute-heavy environment, they nodes are > going to be running at 100% CPU anyway. > > Also keep in mind that you're only going to have "waste" spinning in > MPI if you have a loosely/poorly synchronized application. Granted, > some applications are this way by nature, but we have not chosen to > optimize spare CPU cycles for them. As I said in a prior mail, adding > a blocking strategy is on the to-do list, but it's fairly low in > priority right now. Someone may care / improve the message passing > engine to include blocking, but it hasn't happened yet. Want to work > on it? :-) > > And for reference: almost all MPI's do busy polling to minimize > latency. Some of them will shift to blocking if nothing happens for a > "long" time. This second piece is what OMPI is lacking. > >> Why >> don't you use blocking and/or signals instead of >> that? > > FWIW: I mentioned this in my other mail -- latency is quite definitely > negatively impacted when you use such mechanisms. Blocking and > signals are "slow" (in comparison to polling). > >> I think the priority of this task is very high because polling >> just wastes resources of the system. > > In production HPC environments, the entire resource is dedicated to > the MPI app anyway, so there's nothing else that really needs it. So > we allow them to busy-spin. > > There is a mode to call yield() in the middle of every OMPI progress > loop, but it's only helpful for loosely/poorly synchronized MPI apps > and ones that use TCP or shared memory. Low latency networks such as > IB or Myrinet won't be as friendly to this setting because they're > busy polling (i.e., they call yield() much less frequently, if at all). > >> On the other hand, >> what Alberto claims is not reasonable to me. >> >> Alberto, >> - Are you oversubscribing one node which means that you are running >> your >> code on a single processor machine, pretending >> to have four CPUs? >> >> - Did you compile Open-MPI or installed from RPM? >> >> Receiving process shouldn't be that expensive. >> >> Regards, >> >> Danesh >> >> >> >> Jeff Squyres skrev: >>> Because on-node communication typically uses shared memory, so we >>> currently have to poll. Additionally, when using mixed on/off-node >>> communication, we have to alternate between polling shared memory and >>> polling the network. >>> >>> Additionally, we actively poll because it's the best way to lower >>> latency. MPI implementations are almost always first judged on their >>> latency, not [usually] their CPU utilization. Going to sleep in a >>> blocking system call will definitely negatively impact latency. >>> >>> We have plans for implementing the "spin for a while and then block" >>> technique (as has been used in other MPI's and middleware layers), >>> but >>> it hasn't been a high priority. >>> >>> >>> On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote: >>> >>> Thanks Torje. I wonder what is the benefit of looping on the incoming message-queue socket rather than using system I/O signals, like read () or select(). On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote: > Hi Alberto, > > The blocked processes are in fact spin-waiting. While they don't > have > anything better to do (waiting for that message), they will check > their incoming message-queues in a loop. > > So the MPI_Recv()-operation is blocking, but it doesn't mean that > the
Re: [OMPI users] (no subject)
Contributions are always welcome. :-) http://www.open-mpi.org/community/contribute/ To be less glib: Open MPI represents the union of the interests of its members. So far, we've *talked* internally about adding a spin-then- block mechanism, but there's a non-trivial amount of work to make that happen. Shared memory is the sticking point -- we have some good ideas how to make it work, but no one's had the time / resources to do it. To be absolutely clear: no one's debating the value of adding blocking progress (as long as it's implemented in a way that absolutely does not affect the performance critical code path). It's just that so far, it has not been important to any current member to add it (weighed against all the other features that we're working on). If you (or anyone) find blocking progress important, we'd love for you to join Open MPI and contribute the the work necessary to make it happen. On Apr 23, 2008, at 5:38 PM, Ingo Josopait wrote: I can think of several advantages that using blocking or signals to reduce the cpu load would have: - Reduced energy consumption - Running additional background programs could be done far more efficiently - It would be much simpler to examine the load balance. It may depend on the type of program and the computational environment, but there are certainly many cases in which putting the system in idle mode would be advantageous. This is especially true for programs with low network traffic and/or high load imbalances. The "spin for a while and then block" method that you mentioned earlier seems to be a good compromise. Just do polling for some time that is long compared to the corresponding system call, and then go to sleep if nothing happens. In this way, the latency would be only marginally increased, while less cpu time is wasted in the polling loops, and I would be much happier. Jeff Squyres schrieb: On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote: Do you really mean that Open-MPI uses busy loop in order to handle incomming calls? It seems to be incorrect since spinning is a very bad and inefficient technique for this purpose. It depends on what you're optimizing for. :-) We're optimizing for minimum message passing latency on hosts that are not oversubscribed; polling is very good at that. Polling is much better than blocking, particularly if the blocking involves a system call (which will be "slow"). Note that in a compute-heavy environment, they nodes are going to be running at 100% CPU anyway. Also keep in mind that you're only going to have "waste" spinning in MPI if you have a loosely/poorly synchronized application. Granted, some applications are this way by nature, but we have not chosen to optimize spare CPU cycles for them. As I said in a prior mail, adding a blocking strategy is on the to-do list, but it's fairly low in priority right now. Someone may care / improve the message passing engine to include blocking, but it hasn't happened yet. Want to work on it? :-) And for reference: almost all MPI's do busy polling to minimize latency. Some of them will shift to blocking if nothing happens for a "long" time. This second piece is what OMPI is lacking. Why don't you use blocking and/or signals instead of that? FWIW: I mentioned this in my other mail -- latency is quite definitely negatively impacted when you use such mechanisms. Blocking and signals are "slow" (in comparison to polling). I think the priority of this task is very high because polling just wastes resources of the system. In production HPC environments, the entire resource is dedicated to the MPI app anyway, so there's nothing else that really needs it. So we allow them to busy-spin. There is a mode to call yield() in the middle of every OMPI progress loop, but it's only helpful for loosely/poorly synchronized MPI apps and ones that use TCP or shared memory. Low latency networks such as IB or Myrinet won't be as friendly to this setting because they're busy polling (i.e., they call yield() much less frequently, if at all). On the other hand, what Alberto claims is not reasonable to me. Alberto, - Are you oversubscribing one node which means that you are running your code on a single processor machine, pretending to have four CPUs? - Did you compile Open-MPI or installed from RPM? Receiving process shouldn't be that expensive. Regards, Danesh Jeff Squyres skrev: Because on-node communication typically uses shared memory, so we currently have to poll. Additionally, when using mixed on/off-node communication, we have to alternate between polling shared memory and polling the network. Additionally, we actively poll because it's the best way to lower latency. MPI implementations are almost always first judged on their latency, not [usually] their CPU utilization. Going to sleep in a blocking system call will definitely negatively impact latency. We have plans for imple
[OMPI users] Busy waiting [was Re: (no subject)]
On Wed, Apr 23, 2008 at 11:38:41PM +0200, Ingo Josopait wrote: > I can think of several advantages that using blocking or signals to > reduce the cpu load would have: > > - Reduced energy consumption Not necessarily. Any time the program ends up running longer, the cluster is up and running (and wasting electricity) for that amount of time. In the case where lots of tiny messages are being sent you could easily end up using more energy. > - Running additional background programs could be done far more efficiently It's usually more efficient -- especially in terms of cache -- to batch up programs to run one after the other instead of running them simultaneously. > - It would be much simpler to examine the load balance. This is true, but it's still pretty trivial to measure load imbalance. MPI allows you to write a wrapper library that intercepts any MPI_* call. You can instrument the code however you like, then call PMPI_*, then catch the return value, finish your instrumentation, and return control to your program. Here's some pseudocode: int MPI_Barrier(MPI_Comm comm){ gettimeofday(&start, NULL); rc=PMPI_Barrier( comm ); gettimeofday(&stop, NULL); fprintf( logfile, "Barrier on node %d took %lf seconds\n", rank, delta(&stop, &start) ); return rc; } I've got some code that does this for all of the MPI calls in OpenMPI (ah, the joys of writing C code using python scripts). Let me know if you'd find it useful. > It may depend on the type of program and the computational environment, > but there are certainly many cases in which putting the system in idle > mode would be advantageous. This is especially true for programs with > low network traffic and/or high load imbalances. I could use a few more benchmarks like that. Seriously, if you're mostly concerned about saving energy, a quick hack is to set a timer as soon as you enter an MPI call (say for 100ms) and if the timer goes off while you're still in the call, use DVS to drop your CPU frequency to the lowest value it has. Then, when you exit the MPI call, pop it back up to the highest frequency. This can save a significant amount of energy, but even here there can be a performance penalty. For example, UMT2K schleps around very large messages, and you really need to be running as fast as possible during the MPI_Waitall calls or the program will slow down by 1% or so (thus using more energy). Doing this just for Barriers and Allreduces seems to speed up the program a tiny bit, but I haven't done enough runs to make sure this isn't an artifact. (This is my dissertation topic, so before asking any question be advised that I WILL talk your ear off.) > The "spin for a while and then block" method that you mentioned earlier > seems to be a good compromise. Just do polling for some time that is > long compared to the corresponding system call, and then go to sleep if > nothing happens. In this way, the latency would be only marginally > increased, while less cpu time is wasted in the polling loops, and I > would be much happier. > I'm interested in seeing what this does for energy savings. Are you volunteering to test a patch? (I've got four other papers I need to get finished up, so it'll be a few weeks before I start coding.) Barry Rountree Ph.D. Candidate, Computer Science University of Georgia > > > > > Jeff Squyres schrieb: > > On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote: > > > >> Do you really mean that Open-MPI uses busy loop in order to handle > >> incomming calls? It seems to be incorrect since > >> spinning is a very bad and inefficient technique for this purpose. > > > > It depends on what you're optimizing for. :-) We're optimizing for > > minimum message passing latency on hosts that are not oversubscribed; > > polling is very good at that. Polling is much better than blocking, > > particularly if the blocking involves a system call (which will be > > "slow"). Note that in a compute-heavy environment, they nodes are > > going to be running at 100% CPU anyway. > > > > Also keep in mind that you're only going to have "waste" spinning in > > MPI if you have a loosely/poorly synchronized application. Granted, > > some applications are this way by nature, but we have not chosen to > > optimize spare CPU cycles for them. As I said in a prior mail, adding > > a blocking strategy is on the to-do list, but it's fairly low in > > priority right now. Someone may care / improve the message passing > > engine to include blocking, but it hasn't happened yet. Want to work > > on it? :-) > > > > And for reference: almost all MPI's do busy polling to minimize > > latency. Some of them will shift to blocking if nothing happens for a > > "long" time. This second piece is what OMPI is lacking. > > > >> Why > >> don't you use blocking and/or signals instead of > >> that? > > > > FWIW: I mentioned this in my other mail -- latency i
Re: [OMPI users] Processor affinitiy
Note: I'm running Tiger (Darwin 8.11.1). Things might have changed with Leopard. On Apr 23, 2008, at 5:30 PM, Jeff Squyres wrote: On Apr 23, 2008, at 3:01 PM, Alberto Giannetti wrote: I would like to run one of my MPI processors to a single core on my iMac Intel Core Duo system. I'm using release 1.2.4 on Darwin 8.11.1. It looks like processor affinity is not supported for this kind of configuration: I'm afraid that OS X doesn't have an API for processor affinity (so there's no API for OMPI to call). :-( You might want to file a feature enhancement with Apple to have them add one. :-) -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] (no subject)
No oversubscription. I did not recompiled OMPI or installed from RPM. On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote: Do you really mean that Open-MPI uses busy loop in order to handle incomming calls? It seems to be incorrect since spinning is a very bad and inefficient technique for this purpose. Why don't you use blocking and/or signals instead of that? I think the priority of this task is very high because polling just wastes resources of the system. On the other hand, what Alberto claims is not reasonable to me. Alberto, - Are you oversubscribing one node which means that you are running your code on a single processor machine, pretending to have four CPUs? - Did you compile Open-MPI or installed from RPM? Receiving process shouldn't be that expensive. Regards, Danesh Jeff Squyres skrev: Because on-node communication typically uses shared memory, so we currently have to poll. Additionally, when using mixed on/off-node communication, we have to alternate between polling shared memory and polling the network. Additionally, we actively poll because it's the best way to lower latency. MPI implementations are almost always first judged on their latency, not [usually] their CPU utilization. Going to sleep in a blocking system call will definitely negatively impact latency. We have plans for implementing the "spin for a while and then block" technique (as has been used in other MPI's and middleware layers), but it hasn't been a high priority. On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote: Thanks Torje. I wonder what is the benefit of looping on the incoming message-queue socket rather than using system I/O signals, like read () or select(). On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote: Hi Alberto, The blocked processes are in fact spin-waiting. While they don't have anything better to do (waiting for that message), they will check their incoming message-queues in a loop. So the MPI_Recv()-operation is blocking, but it doesn't mean that the processes are blocked by the OS scheduler. I hope that made some sense :) Best regards, Torje On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote: I have simple MPI program that sends data to processor rank 0. The communication works well but when I run the program on more than 2 processors (-np 4) the extra receivers waiting for data run on > 90% CPU load. I understand MPI_Recv() is a blocking operation, but why does it consume so much CPU compared to a regular system read()? #include #include #include #include #include void process_sender(int); void process_receiver(int); int main(int argc, char* argv[]) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("Processor %d (%d) initialized\n", rank, getpid()); if( rank == 1 ) process_sender(rank); else process_receiver(rank); MPI_Finalize(); } void process_sender(int rank) { int i, j, size; float data[100]; MPI_Status status; printf("Processor %d initializing data...\n", rank); for( i = 0; i < 100; ++i ) data[i] = i; MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Processor %d sending data...\n", rank); MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD); printf("Processor %d sent data\n", rank); } void process_receiver(int rank) { int count; float value[200]; MPI_Status status; printf("Processor %d waiting for data...\n", rank); MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55, MPI_COMM_WORLD, &status); printf("Processor %d Got data from processor %d\n", rank, status.MPI_SOURCE); MPI_Get_count(&status, MPI_FLOAT, &count); printf("Processor %d, Got %d elements\n", rank, count); } ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Problem compiling open MPI on cygwin on windows
Hi, New to open MPI, but have used MPI before. I am trying to compile open MPI on cygwin on widows XP. From what I have read this should work? Initially I hit a problem with the 1.2.6 standard download in that time related header file was incorrect and the mailing list pointed me to the trunk build to solve that problem. Now when I try to compile I am getting the following error at the bottom of this mail. My question is am I wasting my time trying to use cygwin, or are there people out there using it on cygwin. If so, is there a solution to the problem below? Thanks in Advance, Michael. mv -f $depbase.Tpo $depbase.Plo libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../orte/include -I../../../../ompi/include -I../../../../op al/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -D_REENTRANT -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT paffinity_window s_module.lo -MD -MP -MF .deps/paffinity_windows_module.Tpo -c paffinity_windows_module.c -DDLL_EXPORT -DPIC -o .libs/paffinity_windows_modu le.o paffinity_windows_module.c:41: error: parse error before "sys_info" paffinity_windows_module.c:41: warning: data definition has no type or storage class paffinity_windows_module.c: In function `windows_module_get_num_procs': paffinity_windows_module.c:90: error: request for member `dwNumberOfProcessors' in something not a structure or union paffinity_windows_module.c: In function `windows_module_set': paffinity_windows_module.c:96: error: `HANDLE' undeclared (first use in this function) paffinity_windows_module.c:96: error: (Each undeclared identifier is reported only once paffinity_windows_module.c:96: error: for each function it appears in.) paffinity_windows_module.c:96: error: parse error before "threadid" paffinity_windows_module.c:97: error: `DWORD_PTR' undeclared (first use in this function) paffinity_windows_module.c:99: error: `threadid' undeclared (first use in this function) paffinity_windows_module.c:99: error: `process_mask' undeclared (first use in this function) paffinity_windows_module.c:99: error: `system_mask' undeclared (first use in this function) paffinity_windows_module.c: In function `windows_module_get': paffinity_windows_module.c:116: error: `HANDLE' undeclared (first use in this function) paffinity_windows_module.c:116: error: parse error before "threadid" paffinity_windows_module.c:117: error: `DWORD_PTR' undeclared (first use in this function) paffinity_windows_module.c:119: error: `threadid' undeclared (first use in this function) paffinity_windows_module.c:119: error: `process_mask' undeclared (first use in this function) paffinity_windows_module.c:119: error: `system_mask' undeclared (first use in this function) make[2]: *** [paffinity_windows_module.lo] Error 1 make[2]: Leaving directory `/home/Michael/mpi/openmpi-1.3a1r18208/opal/mca/paffinity/windows' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/Michael/mpi/openmpi-1.3a1r18208/opal' make: *** [all-recursive] Error 1
Re: [OMPI users] Processor affinitiy
Things have not changed with Leopard. :-( On Apr 23, 2008, at 6:26 PM, Alberto Giannetti wrote: Note: I'm running Tiger (Darwin 8.11.1). Things might have changed with Leopard. On Apr 23, 2008, at 5:30 PM, Jeff Squyres wrote: On Apr 23, 2008, at 3:01 PM, Alberto Giannetti wrote: I would like to run one of my MPI processors to a single core on my iMac Intel Core Duo system. I'm using release 1.2.4 on Darwin 8.11.1. It looks like processor affinity is not supported for this kind of configuration: I'm afraid that OS X doesn't have an API for processor affinity (so there's no API for OMPI to call). :-( You might want to file a feature enhancement with Apple to have them add one. :-) -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Problem compiling open MPI on cygwin on windows
This component is not supposed to get included in a cygwin build. The name containing windows indicate it is for native windows compilation, and not for cygwin nor SUA. Last time I checked I manage to compile everything statically. Unfortunately, I never tried to do a dynamic build ... I'll give it a try tomorrow. Thanks, george. On Apr 23, 2008, at 7:51 PM, Michael wrote: Hi, New to open MPI, but have used MPI before. I am trying to compile open MPI on cygwin on widows XP. From what I have read this should work? Initially I hit a problem with the 1.2.6 standard download in that time related header file was incorrect and the mailing list pointed me to the trunk build to solve that problem. Now when I try to compile I am getting the following error at the bottom of this mail. My question is am I wasting my time trying to use cygwin, or are there people out there using it on cygwin. If so, is there a solution to the problem below? Thanks in Advance, Michael. mv -f $depbase.Tpo $depbase.Plo libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../orte/include -I../../../../ompi/include -I../../../../op al/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -D_REENTRANT -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT paffinity_window s_module.lo -MD -MP -MF .deps/paffinity_windows_module.Tpo -c paffinity_windows_module.c -DDLL_EXPORT -DPIC -o .libs/paffinity_windows_modu le.o paffinity_windows_module.c:41: error: parse error before "sys_info" paffinity_windows_module.c:41: warning: data definition has no type or storage class paffinity_windows_module.c: In function `windows_module_get_num_procs': paffinity_windows_module.c:90: error: request for member `dwNumberOfProcessors' in something not a structure or union paffinity_windows_module.c: In function `windows_module_set': paffinity_windows_module.c:96: error: `HANDLE' undeclared (first use in this function) paffinity_windows_module.c:96: error: (Each undeclared identifier is reported only once paffinity_windows_module.c:96: error: for each function it appears in.) paffinity_windows_module.c:96: error: parse error before "threadid" paffinity_windows_module.c:97: error: `DWORD_PTR' undeclared (first use in this function) paffinity_windows_module.c:99: error: `threadid' undeclared (first use in this function) paffinity_windows_module.c:99: error: `process_mask' undeclared (first use in this function) paffinity_windows_module.c:99: error: `system_mask' undeclared (first use in this function) paffinity_windows_module.c: In function `windows_module_get': paffinity_windows_module.c:116: error: `HANDLE' undeclared (first use in this function) paffinity_windows_module.c:116: error: parse error before "threadid" paffinity_windows_module.c:117: error: `DWORD_PTR' undeclared (first use in this function) paffinity_windows_module.c:119: error: `threadid' undeclared (first use in this function) paffinity_windows_module.c:119: error: `process_mask' undeclared (first use in this function) paffinity_windows_module.c:119: error: `system_mask' undeclared (first use in this function) make[2]: *** [paffinity_windows_module.lo] Error 1 make[2]: Leaving directory `/home/Michael/mpi/openmpi-1.3a1r18208/opal/mca/paffinity/windows' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/Michael/mpi/openmpi-1.3a1r18208/ opal' make: *** [all-recursive] Error 1 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature