[O-MPI users] Problems running svn7132 with more than on proc per node
Hello, I'm playing with a copy of svn7132 that built and installed just fine. At first everything seemed ok, unlike earlier it now runs on mvapi automagically :-) But then a small testprogram failed and then another. After scratching my head a while I realised that the pattern was that as soon as I had two ranks sharing one node and used "mpi_leave_pinned 1" it broke... (segfaulted) Here is a bidirect point-to-point running two ranks on the same host (this one actually starts but segfaults half way through): NODEFILE is "n50 n50" [cap@n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1 --np 2 mpibibench.ompi7132 Using Zero pattern. starting _bidirect_ lat-bw test. Latency: 1.8 µsec (total)Bandwidth: 0.0 bytes/s (0 x 1) Latency: 2.0 µsec (total)Bandwidth: 1.0 Mbytes/s (1 x 1) Latency: 2.0 µsec (total)Bandwidth: 2.0 Mbytes/s (2 x 1) Latency: 1.9 µsec (total)Bandwidth: 4.2 Mbytes/s (4 x 1) Latency: 2.0 µsec (total)Bandwidth: 8.1 Mbytes/s (8 x 1) Latency: 2.2 µsec (total)Bandwidth: 14.8 Mbytes/s (16 x 1) Latency: 2.0 µsec (total)Bandwidth: 31.7 Mbytes/s (32 x 1) Latency: 2.2 µsec (total)Bandwidth: 57.3 Mbytes/s (64 x 1) Latency: 2.2 µsec (total)Bandwidth: 114.3 Mbytes/s (128 x 1) Latency: 2.3 µsec (total)Bandwidth: 224.8 Mbytes/s (256 x 1) Latency: 2.8 µsec (total)Bandwidth: 369.8 Mbytes/s (512 x 1) mpirun noticed that job rank 0 with PID 5879 on node "n50" exited on signal 11. 1 additional process aborted (not shown) from dmesg: mpibibench.ompi[5879]: segfault at rip rsp 007fbfffe8e8 error 14 running on more than one node seems to die instantly (simple all-to-all app): NODEFILE is "n50 n50 n49 n49" [cap@n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1 --np 4 alltoall.ompi7132 mpirun noticed that job rank 3 with PID 27857 on node "n49" exited on signal 11. 3 additional processes aborted (not shown) and with similar segfault on dmesg Either running with one proc per node or skipping mpi_leave_pinned makes it work 100% Is this expected? tia, Peter System config: OS: centos-4.1 x86_64 2.6.9-11smp (el4u1) ompi: svn7132 vpath build with recommended libtool/autoconf/automake compilers: 64-bit icc/ifort 8.1-029 configure: ./configure --prefix=xxx --with-btl-mvapi=yyy --disable-cxx --disable-f90 --disable-io-romio -- Peter Kjellström | National Supercomputer Centre | Sweden | http://www.nsc.liu.se pgpg3mftJmPTK.pgp Description: PGP signature
Re: [O-MPI users] open-mpi team - good job and a question
On Sep 1, 2005, at 9:11 PM, Borenstein, Bernard S wrote: I tried a recent tarball of open-mpi and it worked well on my 64 bit linux cluster running the Nasa Overflow 1.8ab cfd code. I'm looking forward to the 1.0 release of the product. The only strange thing I noticed is that the output file is written in DOS format (with CR and LF) on my unix system. Did I miss something or is this the way it is supposed to work?? I'm not sure that I understand your questions -- Open MPI shouldn't really have anything to do with the output file of your application unless it is written with MPI-2 I/O functions. Is this a file that Overflow itself generates? -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI users] Problems running svn7132 with more than on proc per node
Thanks Peter, We'll look into this... Tim Peter Kjellström wrote: Hello, I'm playing with a copy of svn7132 that built and installed just fine. At first everything seemed ok, unlike earlier it now runs on mvapi automagically :-) But then a small testprogram failed and then another. After scratching my head a while I realised that the pattern was that as soon as I had two ranks sharing one node and used "mpi_leave_pinned 1" it broke... (segfaulted) Here is a bidirect point-to-point running two ranks on the same host (this one actually starts but segfaults half way through): NODEFILE is "n50 n50" [cap@n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1 --np 2 mpibibench.ompi7132 Using Zero pattern. starting _bidirect_ lat-bw test. Latency: 1.8 µsec (total)Bandwidth: 0.0 bytes/s (0 x 1) Latency: 2.0 µsec (total)Bandwidth: 1.0 Mbytes/s (1 x 1) Latency: 2.0 µsec (total)Bandwidth: 2.0 Mbytes/s (2 x 1) Latency: 1.9 µsec (total)Bandwidth: 4.2 Mbytes/s (4 x 1) Latency: 2.0 µsec (total)Bandwidth: 8.1 Mbytes/s (8 x 1) Latency: 2.2 µsec (total)Bandwidth: 14.8 Mbytes/s (16 x 1) Latency: 2.0 µsec (total)Bandwidth: 31.7 Mbytes/s (32 x 1) Latency: 2.2 µsec (total)Bandwidth: 57.3 Mbytes/s (64 x 1) Latency: 2.2 µsec (total)Bandwidth: 114.3 Mbytes/s (128 x 1) Latency: 2.3 µsec (total)Bandwidth: 224.8 Mbytes/s (256 x 1) Latency: 2.8 µsec (total)Bandwidth: 369.8 Mbytes/s (512 x 1) mpirun noticed that job rank 0 with PID 5879 on node "n50" exited on signal 11. 1 additional process aborted (not shown) from dmesg: mpibibench.ompi[5879]: segfault at rip rsp 007fbfffe8e8 error 14 running on more than one node seems to die instantly (simple all-to-all app): NODEFILE is "n50 n50 n49 n49" [cap@n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1 --np 4 alltoall.ompi7132 mpirun noticed that job rank 3 with PID 27857 on node "n49" exited on signal 11. 3 additional processes aborted (not shown) and with similar segfault on dmesg Either running with one proc per node or skipping mpi_leave_pinned makes it work 100% Is this expected? tia, Peter System config: OS: centos-4.1 x86_64 2.6.9-11smp (el4u1) ompi: svn7132 vpath build with recommended libtool/autoconf/automake compilers: 64-bit icc/ifort 8.1-029 configure: ./configure --prefix=xxx --with-btl-mvapi=yyy --disable-cxx --disable-f90 --disable-io-romio ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [O-MPI users] open-mpi team - good job and a question
On Sep 1, 2005, at 8:11 PM, Borenstein, Bernard S wrote: I tried a recent tarball of open-mpi and it worked well on my 64 bit linux cluster running the Nasa Overflow 1.8ab cfd code. I’m looking forward to the 1.0 release of the product. The only strange thing I noticed is that the output file is written in DOS format (with CR and LF) on my unix system. Did I miss something or is this the way it is supposed to work?? I'm not familiar with the NASA Overflow code. I'm assuming that you are talking about output written via stdout/stderr? If so, we play some small games with ptys to do I/O redirection, but I haven't seen anything like this before. Can you provide any more information (where the output is going to, how the DOS/UNIX format is selected, etc) that might help pin the issue down? Brian