Re: [OMPI users] Problem running on multiple nodes with Java bindings
(Correction; I mixed up the output of the two first examples in my first mail, so it fails on the first one) ubuntu@node0:~$ mpirun --leave-session-attached -mca plm_base_verbose 5 -np 4 -host node0,node1,node2,node3 hostname [node0:01486] mca:base:select:( plm) Querying component [slurm] [node0:01486] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [node0:01486] mca:base:select:( plm) Querying component [rsh] [node0:01486] mca:base:select:( plm) Query of component [rsh] set priority to 10 [node0:01486] mca:base:select:( plm) Selected component [rsh] [node2:26962] mca:base:select:( plm) Querying component [rsh] [node2:26962] mca:base:select:( plm) Query of component [rsh] set priority to 10 [node2:26962] mca:base:select:( plm) Selected component [rsh] [node1:11477] mca:base:select:( plm) Querying component [rsh] [node1:11477] mca:base:select:( plm) Query of component [rsh] set priority to 10 [node1:11477] mca:base:select:( plm) Selected component [rsh] Host key verification failed. ubuntu@node0:~$ mpirun -mca plm_rsh_no_tree_spawn 1 -np 4 -host node0,node1,node2,node3 hostname node0 node1 node2 node3 So it definetely looks like a problem with the tree spawn. Any clue how I could proceed? /Christoffer 2013/11/11 Ralph Castain > Add --enable-debug to your configure and run it with the following > additional options > > --leave-session-attached -mca plm_base_verbose 5 > > Let's see where it fails during the launch phase. Offhand, the only thing > that message means to me is that the ssh keys are botched on at least one > node. Keep in mind that we use a tree-based launch, and so when you have > more than two nodes, one or more of the intermediate nodes are executing an > ssh. > > One way to see if that's the problem is to launch without the tree spawn: > add > > -mca plm_rsh_no_tree_spawn 1 > > to your cmd line and see if it works. > > > > On Nov 10, 2013, at 9:24 AM, Christoffer Hamberg < > christoffer.hamb...@gmail.com> wrote: > > Hi, > > I'm having some strange problems running Open MPI(1.9a1r29559) with Java > bindings on a Calxeda highbank ARM Server running Ubuntu 12.10 (GNU/Linux > 3.5.0-43-highbank armv7l). > > The problem arises when I try to run a job on more than 3 nodes (I have a > total of 8). > Note: It's the same error for any of the node[0-7]. > > ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2 hostname > Host key verification failed. > > ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2,node3 hostname > node0 > node0 > node1 > node2 > > and not running the job on the current node also gives Host key > verification failed for only 3 nodes. > > ubuntu@node0:~$ mpirun -np 4 -host node1,node3,node5 hostname > Host key verification failed. > > But not on 2 nodes: > ubuntu@node0:~$ mpirun -np 4 -host node1,node3 hostname > node1 > node1 > node3 > node3 > > I've configured it with the following: > ./configure --prefix=/opt/openmpi-1.9-java --without-openib > --enable-static --with-threads=posix --enable-mpi-thread-multiple > --enable-mpi-java --with-jdk-bindir=/usr/lib/jvm/java-7-openjdk-armhf/bin > --with-jdk-headers=/usr/lib/jvm/java-7-openjdk-armhf/include > > I have Open MPI 1.6.5 (without Java-binding) installed and it runs without > any problems on all nodes, so there should be no problem with SSH that the > error points to. > > Any ideas? > > Regards, > Christoffer > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Problem running on multiple nodes with Java bindings
Am 11.11.2013 um 10:04 schrieb Christoffer Hamberg: > (Correction; I mixed up the output of the two first examples in my first > mail, so it fails on the first one) > > ubuntu@node0:~$ mpirun --leave-session-attached -mca plm_base_verbose 5 -np 4 > -host node0,node1,node2,node3 hostname > [node0:01486] mca:base:select:( plm) Querying component [slurm] > [node0:01486] mca:base:select:( plm) Skipping component [slurm]. Query > failed to return a module > [node0:01486] mca:base:select:( plm) Querying component [rsh] > [node0:01486] mca:base:select:( plm) Query of component [rsh] set priority > to 10 > [node0:01486] mca:base:select:( plm) Selected component [rsh] > [node2:26962] mca:base:select:( plm) Querying component [rsh] > [node2:26962] mca:base:select:( plm) Query of component [rsh] set priority > to 10 > [node2:26962] mca:base:select:( plm) Selected component [rsh] > [node1:11477] mca:base:select:( plm) Querying component [rsh] > [node1:11477] mca:base:select:( plm) Query of component [rsh] set priority > to 10 > [node1:11477] mca:base:select:( plm) Selected component [rsh] > Host key verification failed. > > > ubuntu@node0:~$ mpirun -mca plm_rsh_no_tree_spawn 1 -np 4 -host > node0,node1,node2,node3 hostname > node0 > node1 > node2 > node3 > > So it definetely looks like a problem with the tree spawn. Any clue how I > could proceed? The passphraseless ssh is also possible between the nodes? Using hostbased authentication it's also possible to enable it for all users without the necessity to prepare the ssh keys. -- Reuti > /Christoffer > > > 2013/11/11 Ralph Castain > Add --enable-debug to your configure and run it with the following additional > options > > --leave-session-attached -mca plm_base_verbose 5 > > Let's see where it fails during the launch phase. Offhand, the only thing > that message means to me is that the ssh keys are botched on at least one > node. Keep in mind that we use a tree-based launch, and so when you have more > than two nodes, one or more of the intermediate nodes are executing an ssh. > > One way to see if that's the problem is to launch without the tree spawn: add > > -mca plm_rsh_no_tree_spawn 1 > > to your cmd line and see if it works. > > > > On Nov 10, 2013, at 9:24 AM, Christoffer Hamberg > wrote: > >> Hi, >> >> I'm having some strange problems running Open MPI(1.9a1r29559) with Java >> bindings on a Calxeda highbank ARM Server running Ubuntu 12.10 (GNU/Linux >> 3.5.0-43-highbank armv7l). >> >> The problem arises when I try to run a job on more than 3 nodes (I have a >> total of 8). >> Note: It's the same error for any of the node[0-7]. >> >> ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2 hostname >> Host key verification failed. >> >> ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2,node3 hostname >> node0 >> node0 >> node1 >> node2 >> >> and not running the job on the current node also gives Host key verification >> failed for only 3 nodes. >> >> ubuntu@node0:~$ mpirun -np 4 -host node1,node3,node5 hostname >> Host key verification failed. >> >> But not on 2 nodes: >> ubuntu@node0:~$ mpirun -np 4 -host node1,node3 hostname >> node1 >> node1 >> node3 >> node3 >> >> I've configured it with the following: >> ./configure --prefix=/opt/openmpi-1.9-java --without-openib --enable-static >> --with-threads=posix --enable-mpi-thread-multiple --enable-mpi-java >> --with-jdk-bindir=/usr/lib/jvm/java-7-openjdk-armhf/bin >> --with-jdk-headers=/usr/lib/jvm/java-7-openjdk-armhf/include >> >> I have Open MPI 1.6.5 (without Java-binding) installed and it runs without >> any problems on all nodes, so there should be no problem with SSH that the >> error points to. >> >> Any ideas? >> >> Regards, >> Christoffer >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem running on multiple nodes with Java bindings
I re-configured the ssh keys now and for some reason it seems to work. But what baffles me is that the same ssh configuration worked for the other installation (1.6.5) but not for this one. Thanks for the help! 2013/11/11 Reuti > Am 11.11.2013 um 10:04 schrieb Christoffer Hamberg: > > > (Correction; I mixed up the output of the two first examples in my first > mail, so it fails on the first one) > > > > ubuntu@node0:~$ mpirun --leave-session-attached -mca plm_base_verbose 5 > -np 4 -host node0,node1,node2,node3 hostname > > [node0:01486] mca:base:select:( plm) Querying component [slurm] > > [node0:01486] mca:base:select:( plm) Skipping component [slurm]. Query > failed to return a module > > [node0:01486] mca:base:select:( plm) Querying component [rsh] > > [node0:01486] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > > [node0:01486] mca:base:select:( plm) Selected component [rsh] > > [node2:26962] mca:base:select:( plm) Querying component [rsh] > > [node2:26962] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > > [node2:26962] mca:base:select:( plm) Selected component [rsh] > > [node1:11477] mca:base:select:( plm) Querying component [rsh] > > [node1:11477] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > > [node1:11477] mca:base:select:( plm) Selected component [rsh] > > Host key verification failed. > > > > > > ubuntu@node0:~$ mpirun -mca plm_rsh_no_tree_spawn 1 -np 4 -host > node0,node1,node2,node3 hostname > > node0 > > node1 > > node2 > > node3 > > > > So it definetely looks like a problem with the tree spawn. Any clue how > I could proceed? > > The passphraseless ssh is also possible between the nodes? Using hostbased > authentication it's also possible to enable it for all users without the > necessity to prepare the ssh keys. > > -- Reuti > > > > /Christoffer > > > > > > 2013/11/11 Ralph Castain > > Add --enable-debug to your configure and run it with the following > additional options > > > > --leave-session-attached -mca plm_base_verbose 5 > > > > Let's see where it fails during the launch phase. Offhand, the only > thing that message means to me is that the ssh keys are botched on at least > one node. Keep in mind that we use a tree-based launch, and so when you > have more than two nodes, one or more of the intermediate nodes are > executing an ssh. > > > > One way to see if that's the problem is to launch without the tree > spawn: add > > > > -mca plm_rsh_no_tree_spawn 1 > > > > to your cmd line and see if it works. > > > > > > > > On Nov 10, 2013, at 9:24 AM, Christoffer Hamberg < > christoffer.hamb...@gmail.com> wrote: > > > >> Hi, > >> > >> I'm having some strange problems running Open MPI(1.9a1r29559) with > Java bindings on a Calxeda highbank ARM Server running Ubuntu 12.10 > (GNU/Linux 3.5.0-43-highbank armv7l). > >> > >> The problem arises when I try to run a job on more than 3 nodes (I have > a total of 8). > >> Note: It's the same error for any of the node[0-7]. > >> > >> ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2 hostname > >> Host key verification failed. > >> > >> ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2,node3 hostname > >> node0 > >> node0 > >> node1 > >> node2 > >> > >> and not running the job on the current node also gives Host key > verification failed for only 3 nodes. > >> > >> ubuntu@node0:~$ mpirun -np 4 -host node1,node3,node5 hostname > >> Host key verification failed. > >> > >> But not on 2 nodes: > >> ubuntu@node0:~$ mpirun -np 4 -host node1,node3 hostname > >> node1 > >> node1 > >> node3 > >> node3 > >> > >> I've configured it with the following: > >> ./configure --prefix=/opt/openmpi-1.9-java --without-openib > --enable-static --with-threads=posix --enable-mpi-thread-multiple > --enable-mpi-java --with-jdk-bindir=/usr/lib/jvm/java-7-openjdk-armhf/bin > --with-jdk-headers=/usr/lib/jvm/java-7-openjdk-armhf/include > >> > >> I have Open MPI 1.6.5 (without Java-binding) installed and it runs > without any problems on all nodes, so there should be no problem with SSH > that the error points to. > >> > >> Any ideas? > >> > >> Regards, > >> Christoffer > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Problem running on multiple nodes with Java bindings
IIRC, 1.6.5 defaults to *not* using the tree spawn. We changed it in 1.7 series because the launch performance is so much better. On Nov 11, 2013, at 8:22 AM, Christoffer Hamberg wrote: > I re-configured the ssh keys now and for some reason it seems to work. But > what baffles me is that the same ssh configuration worked for the other > installation (1.6.5) but not for this one. > > Thanks for the help! > > > 2013/11/11 Reuti > Am 11.11.2013 um 10:04 schrieb Christoffer Hamberg: > > > (Correction; I mixed up the output of the two first examples in my first > > mail, so it fails on the first one) > > > > ubuntu@node0:~$ mpirun --leave-session-attached -mca plm_base_verbose 5 -np > > 4 -host node0,node1,node2,node3 hostname > > [node0:01486] mca:base:select:( plm) Querying component [slurm] > > [node0:01486] mca:base:select:( plm) Skipping component [slurm]. Query > > failed to return a module > > [node0:01486] mca:base:select:( plm) Querying component [rsh] > > [node0:01486] mca:base:select:( plm) Query of component [rsh] set priority > > to 10 > > [node0:01486] mca:base:select:( plm) Selected component [rsh] > > [node2:26962] mca:base:select:( plm) Querying component [rsh] > > [node2:26962] mca:base:select:( plm) Query of component [rsh] set priority > > to 10 > > [node2:26962] mca:base:select:( plm) Selected component [rsh] > > [node1:11477] mca:base:select:( plm) Querying component [rsh] > > [node1:11477] mca:base:select:( plm) Query of component [rsh] set priority > > to 10 > > [node1:11477] mca:base:select:( plm) Selected component [rsh] > > Host key verification failed. > > > > > > ubuntu@node0:~$ mpirun -mca plm_rsh_no_tree_spawn 1 -np 4 -host > > node0,node1,node2,node3 hostname > > node0 > > node1 > > node2 > > node3 > > > > So it definetely looks like a problem with the tree spawn. Any clue how I > > could proceed? > > The passphraseless ssh is also possible between the nodes? Using hostbased > authentication it's also possible to enable it for all users without the > necessity to prepare the ssh keys. > > -- Reuti > > > > /Christoffer > > > > > > 2013/11/11 Ralph Castain > > Add --enable-debug to your configure and run it with the following > > additional options > > > > --leave-session-attached -mca plm_base_verbose 5 > > > > Let's see where it fails during the launch phase. Offhand, the only thing > > that message means to me is that the ssh keys are botched on at least one > > node. Keep in mind that we use a tree-based launch, and so when you have > > more than two nodes, one or more of the intermediate nodes are executing an > > ssh. > > > > One way to see if that's the problem is to launch without the tree spawn: > > add > > > > -mca plm_rsh_no_tree_spawn 1 > > > > to your cmd line and see if it works. > > > > > > > > On Nov 10, 2013, at 9:24 AM, Christoffer Hamberg > > wrote: > > > >> Hi, > >> > >> I'm having some strange problems running Open MPI(1.9a1r29559) with Java > >> bindings on a Calxeda highbank ARM Server running Ubuntu 12.10 (GNU/Linux > >> 3.5.0-43-highbank armv7l). > >> > >> The problem arises when I try to run a job on more than 3 nodes (I have a > >> total of 8). > >> Note: It's the same error for any of the node[0-7]. > >> > >> ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2 hostname > >> Host key verification failed. > >> > >> ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2,node3 hostname > >> node0 > >> node0 > >> node1 > >> node2 > >> > >> and not running the job on the current node also gives Host key > >> verification failed for only 3 nodes. > >> > >> ubuntu@node0:~$ mpirun -np 4 -host node1,node3,node5 hostname > >> Host key verification failed. > >> > >> But not on 2 nodes: > >> ubuntu@node0:~$ mpirun -np 4 -host node1,node3 hostname > >> node1 > >> node1 > >> node3 > >> node3 > >> > >> I've configured it with the following: > >> ./configure --prefix=/opt/openmpi-1.9-java --without-openib > >> --enable-static --with-threads=posix --enable-mpi-thread-multiple > >> --enable-mpi-java --with-jdk-bindir=/usr/lib/jvm/java-7-openjdk-armhf/bin > >> --with-jdk-headers=/usr/lib/jvm/java-7-openjdk-armhf/include > >> > >> I have Open MPI 1.6.5 (without Java-binding) installed and it runs without > >> any problems on all nodes, so there should be no problem with SSH that the > >> error points to. > >> > >> Any ideas? > >> > >> Regards, > >> Christoffer > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org >
Re: [OMPI users] Problem running on multiple nodes with Java bindings
That explains, thank you for the quick answer. 2013/11/11 Ralph Castain > IIRC, 1.6.5 defaults to *not* using the tree spawn. We changed it in 1.7 > series because the launch performance is so much better. > > > On Nov 11, 2013, at 8:22 AM, Christoffer Hamberg < > christoffer.hamb...@gmail.com> wrote: > > I re-configured the ssh keys now and for some reason it seems to work. But > what baffles me is that the same ssh configuration worked for the other > installation (1.6.5) but not for this one. > > Thanks for the help! > > > 2013/11/11 Reuti > >> Am 11.11.2013 um 10:04 schrieb Christoffer Hamberg: >> >> > (Correction; I mixed up the output of the two first examples in my >> first mail, so it fails on the first one) >> > >> > ubuntu@node0:~$ mpirun --leave-session-attached -mca plm_base_verbose >> 5 -np 4 -host node0,node1,node2,node3 hostname >> > [node0:01486] mca:base:select:( plm) Querying component [slurm] >> > [node0:01486] mca:base:select:( plm) Skipping component [slurm]. Query >> failed to return a module >> > [node0:01486] mca:base:select:( plm) Querying component [rsh] >> > [node0:01486] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> > [node0:01486] mca:base:select:( plm) Selected component [rsh] >> > [node2:26962] mca:base:select:( plm) Querying component [rsh] >> > [node2:26962] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> > [node2:26962] mca:base:select:( plm) Selected component [rsh] >> > [node1:11477] mca:base:select:( plm) Querying component [rsh] >> > [node1:11477] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> > [node1:11477] mca:base:select:( plm) Selected component [rsh] >> > Host key verification failed. >> > >> > >> > ubuntu@node0:~$ mpirun -mca plm_rsh_no_tree_spawn 1 -np 4 -host >> node0,node1,node2,node3 hostname >> > node0 >> > node1 >> > node2 >> > node3 >> > >> > So it definetely looks like a problem with the tree spawn. Any clue how >> I could proceed? >> >> The passphraseless ssh is also possible between the nodes? Using >> hostbased authentication it's also possible to enable it for all users >> without the necessity to prepare the ssh keys. >> >> -- Reuti >> >> >> > /Christoffer >> > >> > >> > 2013/11/11 Ralph Castain >> > Add --enable-debug to your configure and run it with the following >> additional options >> > >> > --leave-session-attached -mca plm_base_verbose 5 >> > >> > Let's see where it fails during the launch phase. Offhand, the only >> thing that message means to me is that the ssh keys are botched on at least >> one node. Keep in mind that we use a tree-based launch, and so when you >> have more than two nodes, one or more of the intermediate nodes are >> executing an ssh. >> > >> > One way to see if that's the problem is to launch without the tree >> spawn: add >> > >> > -mca plm_rsh_no_tree_spawn 1 >> > >> > to your cmd line and see if it works. >> > >> > >> > >> > On Nov 10, 2013, at 9:24 AM, Christoffer Hamberg < >> christoffer.hamb...@gmail.com> wrote: >> > >> >> Hi, >> >> >> >> I'm having some strange problems running Open MPI(1.9a1r29559) with >> Java bindings on a Calxeda highbank ARM Server running Ubuntu 12.10 >> (GNU/Linux 3.5.0-43-highbank armv7l). >> >> >> >> The problem arises when I try to run a job on more than 3 nodes (I >> have a total of 8). >> >> Note: It's the same error for any of the node[0-7]. >> >> >> >> ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2 hostname >> >> Host key verification failed. >> >> >> >> ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2,node3 hostname >> >> node0 >> >> node0 >> >> node1 >> >> node2 >> >> >> >> and not running the job on the current node also gives Host key >> verification failed for only 3 nodes. >> >> >> >> ubuntu@node0:~$ mpirun -np 4 -host node1,node3,node5 hostname >> >> Host key verification failed. >> >> >> >> But not on 2 nodes: >> >> ubuntu@node0:~$ mpirun -np 4 -host node1,node3 hostname >> >> node1 >> >> node1 >> >> node3 >> >> node3 >> >> >> >> I've configured it with the following: >> >> ./configure --prefix=/opt/openmpi-1.9-java --without-openib >> --enable-static --with-threads=posix --enable-mpi-thread-multiple >> --enable-mpi-java --with-jdk-bindir=/usr/lib/jvm/java-7-openjdk-armhf/bin >> --with-jdk-headers=/usr/lib/jvm/java-7-openjdk-armhf/include >> >> >> >> I have Open MPI 1.6.5 (without Java-binding) installed and it runs >> without any problems on all nodes, so there should be no problem with SSH >> that the error points to. >> >> >> >> Any ideas? >> >> >> >> Regards, >> >> Christoffer >> >> ___ >> >> users mailing list >> >> us...@open-mpi.org >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > ___ >> > users
Re: [OMPI users] proper use of MPI_Abort
On Nov 7, 2013, at 2:13 PM, "Andrus, Brian Contractor" wrote: > Good to know. Thanks! (sorry for the delay in replying; I was traveling last week, which always makes a disaster of my INBOX) > Seems really like MPI_ABORT should only be used within error traps after MPI > functions have been started. Correct. We treat MPI_Abort to mean that something Bad has happened, and we should be noisy about it. > Code-wise, the sample I got was not the best. Usage should be checked before > MPI_Initialize, I think :) > > It seems the expectation is that MPI_ABORT is only called when the user > should be notified something went haywire. Yes. Another way you might consider handling this is to have one process do the error checking (e.g., MCW rank 0) and then broadcast out a flag result indicating "time to MPI_Finalize/exit" or "everything looks good; let's proceed." I.e., something like this (pseudocode typed off the top of my head; excuse errors): - int rank, flag = 0; MPI_Init(...); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (0 == rank) { check_for_badness(); if (bad) { fprintf(stderr, "Badness...\n"); flag = 1; } } MPI_Bcast(&flag, 1, MPI_INT, 0, MPI_COMM_WORLD); if (flag) { MPI_Finalize(); exit(1); } - -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_File_write hangs on NFS-mounted filesystem
FWIW, I tried this sample program on OMPI 1.6.x and 1.7.x, and it worked for me on NFS. On Nov 7, 2013, at 12:45 PM, Jeff Hammond wrote: > That's a relatively old version of OMPI. Maybe try the latest > release? That's always the safe bet since the issue might have been > fixed already. > > I recall that OMPI uses ROMIO so you might try to reproduce with MPICH > so you can report it to the people that wrote the MPI-IO code. Of > course, this might not be an issue with ROMIO itself. Trying with > MPICH is a good way to verify that. > > Best, > > Jeff > > Sent from my iPhone > > On Nov 7, 2013, at 10:55 AM, Steven G Johnson wrote: > >> The simple C program attached below hangs on MPI_File_write when I am using >> an NFS-mounted filesystem. Is MPI-IO supported in OpenMPI for NFS >> filesystems? >> >> I'm using OpenMPI 1.4.5 on Debian stable (wheezy), 64-bit Opteron CPU, Linux >> 3.2.51. I was surprised by this because the problems only started >> occurring recently when I upgraded my Debian system to wheezy; with OpenMPI >> in the previous Debian release, output to NFS-mounted filesystems worked >> fine. >> >> Is there any easy way to get this working? Any tips are appreciated. >> >> Regards, >> Steven G. Johnson >> >> --- >> #include >> #include >> #include >> >> void perr(const char *label, int err) >> { >> char s[MPI_MAX_ERROR_STRING]; >> int len; >> MPI_Error_string(err, s, &len); >> printf("%s: %d = %s\n", label, err, s); >> } >> >> int main(int argc, char **argv) >> { >> MPI_Init(&argc, &argv); >> >> MPI_File fh; >> int err; >> err = MPI_File_open(MPI_COMM_WORLD, "tstmpiio.dat", MPI_MODE_CREATE | >> MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); >> perr("open", err); >> >> const char s[] = "Hello world!\n"; >> MPI_Status status; >> err = MPI_File_write(fh, (void*) s, strlen(s), MPI_CHAR, &status); >> perr("write", err); >> >> err = MPI_File_close(&fh); >> perr("close", err); >> >> MPI_Finalize(); >> return 0; >> } >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing
Jim -- This has been bugging me for a while. I finally got to check today: it looks like we compute MPI_STATUS_SIZE correctly in both the 32 and 64 bit cases. That is, MPI_STATUS_SIZE exactly reflects the size of the C MPI_Status (4 int's and a size_t, or 4 * 4 + 8 = 24 bytes), regardless of whether Fortran is 32 or 64 bits. There are two implications here: 1. OMPI should not be overwriting your status array if you're using MPI_STATUS_SIZE as its length. Meaning: in the 32 bit case, MPI_STATUS_SIZE=6, and in the 64 bit case, MPI_STATUS_SIZE=3. 2. In the 64 bit case, you'll have a difficult time extracting the MPI status values from the 8-byte INTEGERs in the status array in Fortran (because the first 2 of 3 each really be 2 4-byte integers). So while #2 is a little weird (and probably should be fixed), a properly-sized MPI_STATUS_SIZE array shouldn't be causing any problems. So I'm still a little befuddled as to why you're seeing an error. :-\ On Nov 1, 2013, at 4:51 PM, Jim Parker wrote: > @Jeff, > > Well, it may have been "just for giggles", but it Worked!! My helloWorld > program ran as expected. My original code ran through the initialization > parts without overwriting data. It will take a few days to finish > computation and analysis to ensure it ran as expected. I'll report back when > I get done. > > It looks like a good Friday! > > Cheers, > --Jim > > On Thu, Oct 31, 2013 at 4:06 PM, Jeff Squyres (jsquyres) > wrote: > For giggles, try using MPI_STATUS_IGNORE (assuming you don't need to look at > the status at all). See if that works for you. > > Meaning: I wonder if we're computing the status size for Fortran incorrectly > in the -i8 case... > > > On Oct 31, 2013, at 1:58 PM, Jim Parker wrote: > > > Some additional info that may jog some solutions. Calls to MPI_SEND do not > > cause memory corruption. Only calls to MPI_RECV. Since the main > > difference is the fact that MPI_RECV needs a "status" array and SEND does > > not, seems to indicate to me that something is wrong with status. > > > > Also, I can run a C version of the helloWorld program with no errors. > > However, int types are only 4-byte. To send 8byte integers, I define > > tempInt as long int and pass MPI_LONG as a type. > > > > @Jeff, > > I got a copy of the openmpi conf.log. See attached. > > > > Cheers, > > --Jim > > > > On Wed, Oct 30, 2013 at 10:55 PM, Jim Parker > > wrote: > > Ok, all, where to begin... > > > > Perhaps I should start with the most pressing issue for me. I need 64-bit > > indexing > > > > @Martin, > >you indicated that even if I get this up and running, the MPI library > > still uses signed 32-bit ints to count (your term), or index (my term) the > > recvbuffer lengths. More concretely, > > in a call to MPI_Allgatherv( buffer, count, MPI_Integer, recvbuf, > > recv-count, displ, MPI_integer, MPI_COMM_WORLD, status, mpierr): count, > > recvcounts, and displs must be 32-bit integers, not 64-bit. Actually, all > > I need is displs to hold 64-bit values... > > If this is true, then compiling OpenMPI this way is not a solution. I'll > > have to restructure my code to collect 31-bit chunks... > > Not that it matters, but I'm not using DIRAC, but a custom code to compute > > circuit analyses. > > > > @Jeff, > > Interesting, your runtime behavior has a different error than mine. You > > have problems with the passed variable tempInt, which would make sense for > > the reasons you gave. However, my problem involves the fact that the local > > variable "rank" gets overwritten by a memory corruption after MPI_RECV is > > called. > > > > Re: config.log. I will try to have the admin guy recompile tomorrow and see > > if I can get the log for you. > > > > BTW, I'm using the gcc 4.7.2 compiler suite on a Rocks 5.4 HPC cluster. I > > use the options -m64 and -fdefault-integer-8 > > > > Cheers, > > --Jim > > > > > > > > On Wed, Oct 30, 2013 at 7:36 PM, Martin Siegert wrote: > > Hi Jim, > > > > I have quite a bit experience with compiling openmpi for dirac. > > Here is what I use to configure openmpi: > > > > ./configure --prefix=$instdir \ > > --disable-silent-rules \ > > --enable-mpirun-prefix-by-default \ > > --with-threads=posix \ > > --enable-cxx-exceptions \ > > --with-tm=$torquedir \ > > --with-wrapper-ldflags="-Wl,-rpath,${instdir}/lib" \ > > --with-openib \ > > --with-hwloc=$hwlocdir \ > > CC=gcc \ > > CXX=g++ \ > > FC="$FC" \ > > F77="$FC" \ > > CFLAGS="-O3" \ > > CXXFLAGS="-O3" \ > > FFLAGS="-O3 $I8FLAG" \ > > FCFLAGS="-O3 $I8FLAG" > > > > You need to set FC to either ifort or gfortran (those are the two compilers > > that I have used) and set I8FLAG to -fdefault-integer-8 for gfortran or > > -i8 for ifort. > > Set torquedir to the directo