Re: [OMPI users] Low performance of Open MPI-1.3 over Gigabit
Hi all, Now LAM-MPI is also installed and tested the fortran application by running with LAM-MPI. But LAM-MPI is performing still worse than Open MPI No of nodes:3 cores per node:8 total core: 3*8=24 CPU TIME :1 HOURS 51 MINUTES 23.49 SECONDS ELAPSED TIME :7 HOURS 28 MINUTES 2.23 SECONDS No of nodes:6 cores used per node:4 total core: 6*4=24 CPU TIME :0 HOURS 51 MINUTES 50.41 SECONDS ELAPSED TIME :6 HOURS 6 MINUTES 38.67 SECONDS Any help/suggetsions to diagnose this problem. Thanks, Sangamesh On Wed, Feb 25, 2009 at 12:51 PM, Sangamesh B wrote: > Dear All, > > A fortran application is installed with Open MPI-1.3 + Intel > compilers on a Rocks-4.3 cluster with Intel Xeon Dual socket Quad core > processor @ 3GHz (8cores/node). > > The time consumed for different tests over a Gigabit connected > nodes are as follows: (Each node has 8 GB memory). > > No of Nodes used:6 No of cores used/node:4 total mpi processes:24 > CPU TIME : 1 HOURS 19 MINUTES 14.39 SECONDS > ELAPSED TIME : 2 HOURS 41 MINUTES 8.55 SECONDS > > No of Nodes used:6 No of cores used/node:8 total mpi processes:48 > CPU TIME : 4 HOURS 19 MINUTES 19.29 SECONDS > ELAPSED TIME : 9 HOURS 15 MINUTES 46.39 SECONDS > > No of Nodes used:3 No of cores used/node:8 total mpi processes:24 > CPU TIME : 2 HOURS 41 MINUTES 27.98 SECONDS > ELAPSED TIME : 4 HOURS 21 MINUTES 0.24 SECONDS > > But the same application performs well on another Linux cluster with > LAM-MPI-7.1.3 > > No of Nodes used:6 No of cores used/node:4 total mpi processes:24 > CPU TIME : 1hours:30min:37.25s > ELAPSED TIME 1hours:51min:10.00S > > No of Nodes used:12 No of cores used/node:4 total mpi processes:48 > CPU TIME : 0hours:46min:13.98s > ELAPSED TIME 1hours:02min:26.11s > > No of Nodes used:6 No of cores used/node:8 total mpi processes:48 > CPU TIME : 1hours:13min:09.17s > ELAPSED TIME 1hours:47min:14.04s > > So there is a huge difference between CPU TIME & ELAPSED TIME for Open MPI > jobs. > > Note: On the same cluster Open MPI gives better performance for > inifiniband nodes. > > What could be the problem for Open MPI over Gigabit? > Any flags need to be used? > Or is it not that good to use Open MPI on Gigabit? > > Thanks, > Sangamesh >
[OMPI users] metahosts (like in MP-MPICH)
Can't find this in FAQ... Can I create the metahost in OpenMPI (a la MP-MPICH), to execute the MPI application simultaneously on several physically different machines connected by TCP/IP? --
Re: [OMPI users] metahosts (like in MP-MPICH)
I'm not quite sure what an MP-MPICH meta host is. Open MPI allows you to specify multiple hosts in a hostfile and run a single MPI job across all of them, assuming they're connected by at least some common TCP network. On Mar 4, 2009, at 4:42 AM, Yury Tarasievich wrote: Can't find this in FAQ... Can I create the metahost in OpenMPI (a la MP-MPICH), to execute the MPI application simultaneously on several physically different machines connected by TCP/IP? -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] libnuma under ompi 1.3
Hmm; that's odd. Is icc / icpc able to find libnuma with no -L, but ifort is unable to find it without a -L? On Mar 3, 2009, at 10:00 PM, Terry Frankcombe wrote: Having just downloaded and installed Open MPI 1.3 with ifort and gcc, I merrily went off to compile my application. In my final link with mpif90 I get the error: /usr/bin/ld: cannot find -lnuma Adding --showme reveals that -I/home/terry/bin/Local/include -pthread -I/home/terry/bin/Local/lib is added to the compile early in the aggregated ifort command, and -L/home/terry/bin/Local/lib -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte -lopen-pal -lpbs -lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -lm - ldl is added to the end. I note than when compiling Open MPI -lnuma was visible in the gcc arguments, with no added -L. On this system libnuma.so exists in /usr/lib64. My (somewhat long!) configure command was ./configure --enable-static --disable-shared --prefix=/home/terry/bin/Local --enable-picky --disable-heterogeneous --without-slurm --without-alps --without-xgrid --without-sge --without-loadleveler --without-lsf F77=ifort Should mpif90 have bundled a -L/usr/lib64 in there somewhere? Regards Terry -- Dr. Terry Frankcombe Research School of Chemistry, Australian National University Ph: (+61) 0417 163 509Skype: terry.frankcombe ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] MPI-IO Inconsistency over Lustre using OMPI 1.3
Unfortunately, we don't have a whole lot of insight into how the internals of the IO support work -- we mainly bundle the ROMIO package from MPICH2 into Open MPI. Our latest integration was the ROMIO from MPICH2 v1.0.7. Do you see the same behavior if you run your application under MPICH2 compiled with Lustre ROMIO support? On Mar 3, 2009, at 12:51 PM, Nathan Baca wrote: Hello, I am seeing inconsistent mpi-io behavior when writing to a Lustre file system using open mpi 1.3 with romio. What follows is a simple reproducer and output. Essentially one or more of the running processes does not read or write the correct amount of data to its part of a file residing on a Lustre (parallel) file system. Any help figuring out what is happening is greatly appreciated. Thanks, Nate program gcrm_test_io implicit none include "mpif.h" integer X_SIZE integer w_me, w_nprocs integer my_info integer i integer (kind=4) :: ierr integer (kind=4) :: fileID integer (kind=MPI_OFFSET_KIND):: mylen integer (kind=MPI_OFFSET_KIND):: offset integer status(MPI_STATUS_SIZE) integer count integer ncells real (kind=4), allocatable, dimension (:) :: array2 logical sync call mpi_init(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr) call mpi_info_create(my_info, ierr) ! optional ways to set things in mpi-io ! call mpi_info_set (my_info, "romio_ds_read" , "enable" , ierr) ! call mpi_info_set (my_info, "romio_ds_write", "enable" , ierr) ! call mpi_info_set (my_info, "romio_cb_write", "enable", ierr) x_size = 410011 ! A 'big' number, with bigger numbers it is more likely to fail sync = .true. ! Extra file synchronization ncells = (X_SIZE * w_nprocs) ! Use node zero to fill it with nines if (w_me .eq. 0) then call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat", MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr) allocate (array2(ncells)) array2(:) = 9.0 mylen = ncells offset = 0 * 4 call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, "native",MPI_INFO_NULL,ierr) call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr) call MPI_Get_count(status,MPI_INTEGER, count, ierr) if (count .ne. mylen) print*, "Wrong initial write count:", count,mylen deallocate(array2) if (sync) call MPI_FILE_SYNC (fileID,ierr) call MPI_FILE_CLOSE (fileID,ierr) endif ! All nodes now fill their area with ones call MPI_BARRIER(MPI_COMM_WORLD,ierr) allocate (array2( X_SIZE)) array2(:) = 1.0 offset = (w_me * X_SIZE) * 4 ! multiply by four, since it is real*4 mylen = X_SIZE call MPI_FILE_OPEN (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY, my_info, fileID, ierr) print*,"node",w_me,"starting",(offset/4) + 1,"ending",(offset/ 4)+mylen call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, "native",MPI_INFO_NULL,ierr) call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr) call MPI_Get_count(status,MPI_INTEGER, count, ierr) if (count .ne. mylen) print*, "Wrong write count:", count,mylen,w_me deallocate(array2) if (sync) call MPI_FILE_SYNC (fileID,ierr) call MPI_FILE_CLOSE (fileID,ierr) ! Read it back on node zero to see if it is ok data if (w_me .eq. 0) then call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat", MPI_MODE_RDONLY, my_info, fileID, ierr) mylen = ncells allocate (array2(ncells)) call MPI_File_read(fileID, array2, mylen , MPI_REAL, status,ierr) call MPI_Get_count(status,MPI_INTEGER, count, ierr) if (count .ne. mylen) print*, "Wrong read count:", count,mylen do i=1,ncells if (array2(i) .ne. 1) then print*, "ERROR", i,array2(i), ((i-1)*4), ((i-1)*4)/ (1024d0*1024d0) ! Index, value, # of good bytes,MB goto 999 end if end do print*, "All done with nothing wrong" 999 deallocate(array2) call MPI_FILE_CLOSE (fileID,ierr) call MPI_file_delete ("output.dat",MPI_INFO_NULL,ierr) endif call mpi_finalize(ierr) end program gcrm_test_io 1.3 Open MPI node 0 starting 1 ending410011 node 1 starting410012 ending820022 node 2 starting820023 ending 1230033 node 3 starting 1230034 ending 1640044 node 4 starting 1640045 ending 2050055 node 5 starting 2050056 ending 2460066 All done with nothi
Re: [OMPI users] Calculation stuck in MPI
No, it is not obvious, unfortunately. Can you send all the information listed here: http://www.open-mpi.org/community/help/ On Mar 3, 2009, at 5:22 AM, Ondrej Marsalek wrote: Dear everyone, I have a calculation (the CP2K program) using MPI over Infiniband and it is stuck. All processes (16 on 4 nodes) are running, taking 100% CPU. Attaching a debugger reveals this (only the end of the stack shown here): (gdb) backtrace #0 0x2b3460916dbf in btl_openib_component_progress () from /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_btl_openib.so #1 0x2b345c22c778 in opal_progress () from /home/marsalek/opt/openmpi-1.3-intel/lib/libopen-pal.so.0 #2 0x2b345bd2d66d in ompi_request_default_wait_any () from /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0 #3 0x2b345bd6021a in PMPI_Waitany () from /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0 #4 0x2b345bae77f1 in pmpi_waitany__ () from /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi_f77.so.0 It has survived a restart of the IB switch, unlike "healthy" runs. My question is - is it obvious at what level the problem is? IB, Open MPI, application?I would be glad to provide detailed information, if anyone was willing to help. I want to work on this, but unfortunately I am not sure where to begin. Best regards, Ondrej Marsalek ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Lahey 64 bit and openmpi 1.3?
On Mar 2, 2009, at 10:17 AM, Tiago Silva wrote: Has anyone had success building openmpi with the 64 bit Lahey fortran compiler? I have seen a previous thread about the problems with 1.2.6 and am wondering if any progress has been made. I can build individual libraries by removing -rpath and -soname, and by compiling the respective objects with -KPIC. Neverheless I couldn't come up with FCFLAGS and LDFLAGS that would both pass the makefile tests and build sucessfully. Unfortunately, I don't think any of us test with the Lahey compiler. So it's quite possible that there may be some issues there. Do you know if GNU Libtool supports the Lahey compiler? We basically support what Libtool supports because Libtool essentially *is* our building process. So if Libtool doesn't support it, then we likely don't either. How do I find the libtool generated script, as suggested in the previous thread? I'm not sure which specific script you're referring to. The "libtool" script itself should generated after you run "configure" -- it should be in the top-level Open MPI directory. openmpi 1.3 Lahey Linux64 8.10a CentOS 5.2 Rocks 5.1 libtool 1.5.22 FWIW, the version of Libtool that you have installed on your system is likely not too important here. Open MPI tarballs come bootstrapped with the Libtool that we used to build the tarball -- *that* included Libtool is used to build Open MPI, not the one installed on your system. We use Libtool 2.2.6a to build Open MPI v1.3. -- Jeff Squyres Cisco Systems
Re: [OMPI users] libnuma under ompi 1.3
Terry Frankcombe wrote: > Having just downloaded and installed Open MPI 1.3 with ifort and gcc, I > merrily went off to compile my application. > > In my final link with mpif90 I get the error: > > /usr/bin/ld: cannot find -lnuma > > Adding --showme reveals that > > -I/home/terry/bin/Local/include -pthread -I/home/terry/bin/Local/lib > > is added to the compile early in the aggregated ifort command, and > > -L/home/terry/bin/Local/lib -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte > -lopen-pal -lpbs -lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl > > is added to the end. > > I note than when compiling Open MPI -lnuma was visible in the gcc > arguments, with no added -L. > > On this system libnuma.so exists in /usr/lib64. My (somewhat long!) > configure command was > > ./configure --enable-static --disable-shared > --prefix=/home/terry/bin/Local --enable-picky --disable-heterogeneous > --without-slurm --without-alps --without-xgrid --without-sge > --without-loadleveler --without-lsf F77=ifort > > > Should mpif90 have bundled a -L/usr/lib64 in there somewhere? > > Regards > Terry > > I had the same exact problem with my PGI compilers, (no problems reported yet with my intel compilers). I have a fix for you. You would think that the compiler would automatically look in /usr/lib64, since that's one of the system's default lib directories, but the PGI compilers don't for some reason. A quick fix is to do OMPI_LDFLAGS="-L/usr/lib64" or OMPI_MPIF90_LDFLAGS="-L/usr/lib64" A more permanent fix is to edit INSTALL_DIR/share/openmpi/mpif90-wrapper-data.txt and change linker_flags= to linker_flags=-L/usr/lib64 In my case, I also had to add the OpenMPI lib directory for the PGI compilers, too. You may or may not need to add them, too: linker_flags=-L/usr/lib64 -L/usr/local/openmpi/pgi/x86_64/lib You may want to test all your compilers and makea similar change to all your *wrapper-data.txt files Not sure if this is a problem with the compilers not picking up the system's lib dirs, or an OpenMPI configuration/build problem. -- Prentice
Re: [OMPI users] libnuma under ompi 1.3
Jeff, See my reply to Dr. Frankcombe's original e-mail. I've experienced this same problem with the PGI compilers, so this isn't limited to just the Intel compilers. I provided a fix, but I think OpenMPI should be able to figure out and add the correct linker flags during the configuration/build stage. Jeff Squyres wrote: > Hmm; that's odd. > > Is icc / icpc able to find libnuma with no -L, but ifort is unable to > find it without a -L? > > On Mar 3, 2009, at 10:00 PM, Terry Frankcombe wrote: > >> Having just downloaded and installed Open MPI 1.3 with ifort and gcc, I >> merrily went off to compile my application. >> >> In my final link with mpif90 I get the error: >> >> /usr/bin/ld: cannot find -lnuma >> >> Adding --showme reveals that >> >> -I/home/terry/bin/Local/include -pthread -I/home/terry/bin/Local/lib >> >> is added to the compile early in the aggregated ifort command, and >> >> -L/home/terry/bin/Local/lib -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte >> -lopen-pal -lpbs -lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl >> >> is added to the end. >> >> I note than when compiling Open MPI -lnuma was visible in the gcc >> arguments, with no added -L. >> >> On this system libnuma.so exists in /usr/lib64. My (somewhat long!) >> configure command was >> >> ./configure --enable-static --disable-shared >> --prefix=/home/terry/bin/Local --enable-picky --disable-heterogeneous >> --without-slurm --without-alps --without-xgrid --without-sge >> --without-loadleveler --without-lsf F77=ifort >> >> >> Should mpif90 have bundled a -L/usr/lib64 in there somewhere? >> >> Regards >> Terry >> >> >> -- >> Dr. Terry Frankcombe >> Research School of Chemistry, Australian National University >> Ph: (+61) 0417 163 509Skype: terry.frankcombe > -- Prentice
Re: [OMPI users] Low performance of Open MPI-1.3 over Gigabit
Your Intel processors are I assume not the new Nehalem/I7 ones? The older quad-core ones are seriously memory bandwidth limited when running a memory intensive application. That might explain why using all 8 cores per node slows down your calculation. Why do you get such a difference between cpu time and elapsed time? Is your code doing any file IO or maybe waiting for one of the processors? Do you use non-blocking communication wherever possible? Regards, Mattijs On Wednesday 04 March 2009 05:46, Sangamesh B wrote: > Hi all, > > Now LAM-MPI is also installed and tested the fortran application by > running with LAM-MPI. > > But LAM-MPI is performing still worse than Open MPI > > No of nodes:3 cores per node:8 total core: 3*8=24 > >CPU TIME :1 HOURS 51 MINUTES 23.49 SECONDS >ELAPSED TIME :7 HOURS 28 MINUTES 2.23 SECONDS > > No of nodes:6 cores used per node:4 total core: 6*4=24 > >CPU TIME :0 HOURS 51 MINUTES 50.41 SECONDS >ELAPSED TIME :6 HOURS 6 MINUTES 38.67 SECONDS > > Any help/suggetsions to diagnose this problem. > > Thanks, > Sangamesh > > On Wed, Feb 25, 2009 at 12:51 PM, Sangamesh B wrote: > > Dear All, > > > > A fortran application is installed with Open MPI-1.3 + Intel > > compilers on a Rocks-4.3 cluster with Intel Xeon Dual socket Quad core > > processor @ 3GHz (8cores/node). > > > > The time consumed for different tests over a Gigabit connected > > nodes are as follows: (Each node has 8 GB memory). > > > > No of Nodes used:6 No of cores used/node:4 total mpi processes:24 > > CPU TIME : 1 HOURS 19 MINUTES 14.39 SECONDS > > ELAPSED TIME : 2 HOURS 41 MINUTES 8.55 SECONDS > > > > No of Nodes used:6 No of cores used/node:8 total mpi processes:48 > > CPU TIME : 4 HOURS 19 MINUTES 19.29 SECONDS > > ELAPSED TIME : 9 HOURS 15 MINUTES 46.39 SECONDS > > > > No of Nodes used:3 No of cores used/node:8 total mpi processes:24 > > CPU TIME : 2 HOURS 41 MINUTES 27.98 SECONDS > > ELAPSED TIME : 4 HOURS 21 MINUTES 0.24 SECONDS > > > > But the same application performs well on another Linux cluster with > > LAM-MPI-7.1.3 > > > > No of Nodes used:6 No of cores used/node:4 total mpi processes:24 > > CPU TIME : 1hours:30min:37.25s > > ELAPSED TIME 1hours:51min:10.00S > > > > No of Nodes used:12 No of cores used/node:4 total mpi processes:48 > > CPU TIME : 0hours:46min:13.98s > > ELAPSED TIME 1hours:02min:26.11s > > > > No of Nodes used:6 No of cores used/node:8 total mpi processes:48 > > CPU TIME : 1hours:13min:09.17s > > ELAPSED TIME 1hours:47min:14.04s > > > > So there is a huge difference between CPU TIME & ELAPSED TIME for Open > > MPI jobs. > > > > Note: On the same cluster Open MPI gives better performance for > > inifiniband nodes. > > > > What could be the problem for Open MPI over Gigabit? > > Any flags need to be used? > > Or is it not that good to use Open MPI on Gigabit? > > > > Thanks, > > Sangamesh > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Mattijs Janssens OpenCFD Ltd. 9 Albert Road, Caversham, Reading RG4 7AN. Tel: +44 (0)118 9471030 Email: m.janss...@opencfd.co.uk URL: http://www.OpenCFD.co.uk
Re: [OMPI users] Low performance of Open MPI-1.3 over Gigabit
It would also help to have some idea how you installed and ran this - e.g., did you set mpi_paffinity_alone so that the processes would bind to their processors? That could explain the cpu vs. elapsed time since it helps the processes from being swapped out as much. Ralph > Your Intel processors are I assume not the new Nehalem/I7 ones? The older > quad-core ones are seriously memory bandwidth limited when running a > memory > intensive application. That might explain why using all 8 cores per node > slows down your calculation. > > Why do you get such a difference between cpu time and elapsed time? Is > your > code doing any file IO or maybe waiting for one of the processors? Do you > use > non-blocking communication wherever possible? > > Regards, > > Mattijs > > On Wednesday 04 March 2009 05:46, Sangamesh B wrote: >> Hi all, >> >> Now LAM-MPI is also installed and tested the fortran application by >> running with LAM-MPI. >> >> But LAM-MPI is performing still worse than Open MPI >> >> No of nodes:3 cores per node:8 total core: 3*8=24 >> >>CPU TIME :1 HOURS 51 MINUTES 23.49 SECONDS >>ELAPSED TIME :7 HOURS 28 MINUTES 2.23 SECONDS >> >> No of nodes:6 cores used per node:4 total core: 6*4=24 >> >>CPU TIME :0 HOURS 51 MINUTES 50.41 SECONDS >>ELAPSED TIME :6 HOURS 6 MINUTES 38.67 SECONDS >> >> Any help/suggetsions to diagnose this problem. >> >> Thanks, >> Sangamesh >> >> On Wed, Feb 25, 2009 at 12:51 PM, Sangamesh B >> wrote: >> > Dear All, >> > >> > A fortran application is installed with Open MPI-1.3 + Intel >> > compilers on a Rocks-4.3 cluster with Intel Xeon Dual socket Quad core >> > processor @ 3GHz (8cores/node). >> > >> > The time consumed for different tests over a Gigabit connected >> > nodes are as follows: (Each node has 8 GB memory). >> > >> > No of Nodes used:6 No of cores used/node:4 total mpi processes:24 >> > CPU TIME : 1 HOURS 19 MINUTES 14.39 SECONDS >> > ELAPSED TIME : 2 HOURS 41 MINUTES 8.55 SECONDS >> > >> > No of Nodes used:6 No of cores used/node:8 total mpi processes:48 >> > CPU TIME : 4 HOURS 19 MINUTES 19.29 SECONDS >> > ELAPSED TIME : 9 HOURS 15 MINUTES 46.39 SECONDS >> > >> > No of Nodes used:3 No of cores used/node:8 total mpi processes:24 >> > CPU TIME : 2 HOURS 41 MINUTES 27.98 SECONDS >> > ELAPSED TIME : 4 HOURS 21 MINUTES 0.24 SECONDS >> > >> > But the same application performs well on another Linux cluster with >> > LAM-MPI-7.1.3 >> > >> > No of Nodes used:6 No of cores used/node:4 total mpi processes:24 >> > CPU TIME : 1hours:30min:37.25s >> > ELAPSED TIME 1hours:51min:10.00S >> > >> > No of Nodes used:12 No of cores used/node:4 total mpi processes:48 >> > CPU TIME : 0hours:46min:13.98s >> > ELAPSED TIME 1hours:02min:26.11s >> > >> > No of Nodes used:6 No of cores used/node:8 total mpi processes:48 >> > CPU TIME : 1hours:13min:09.17s >> > ELAPSED TIME 1hours:47min:14.04s >> > >> > So there is a huge difference between CPU TIME & ELAPSED TIME for Open >> > MPI jobs. >> > >> > Note: On the same cluster Open MPI gives better performance for >> > inifiniband nodes. >> > >> > What could be the problem for Open MPI over Gigabit? >> > Any flags need to be used? >> > Or is it not that good to use Open MPI on Gigabit? >> > >> > Thanks, >> > Sangamesh >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > > Mattijs Janssens > > OpenCFD Ltd. > 9 Albert Road, > Caversham, > Reading RG4 7AN. > Tel: +44 (0)118 9471030 > Email: m.janss...@opencfd.co.uk > URL: http://www.OpenCFD.co.uk > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] metahosts (like in MP-MPICH)
Jeff Squyres wrote: I'm not quite sure what an MP-MPICH meta host is. Open MPI allows you to specify multiple hosts in a hostfile and run a single MPI job across all of them, assuming they're connected by at least some common TCP network. What I need is one MPI job put for distributed computation on several actual machines, connected by TCP/IP (so, kind of cluster computation). Machines may have heterogenous OSes on them (MP-MPICH accounts for that with its HETERO option). I'm somewhat new to MPI. It's possible, that what I describe is an inherent option of MPI implementations. Please advise. --
Re: [OMPI users] libnuma under ompi 1.3
Problem is that some systems install both 32 and 64 bit support, and build OMPI both ways. So we really can't just figure it out without some help. At our location, we simply take care to specify the -L flag to point to the correct version so we avoid any confusion. On Mar 4, 2009, at 8:37 AM, Prentice Bisbal wrote: Jeff, See my reply to Dr. Frankcombe's original e-mail. I've experienced this same problem with the PGI compilers, so this isn't limited to just the Intel compilers. I provided a fix, but I think OpenMPI should be able to figure out and add the correct linker flags during the configuration/build stage. Jeff Squyres wrote: Hmm; that's odd. Is icc / icpc able to find libnuma with no -L, but ifort is unable to find it without a -L? On Mar 3, 2009, at 10:00 PM, Terry Frankcombe wrote: Having just downloaded and installed Open MPI 1.3 with ifort and gcc, I merrily went off to compile my application. In my final link with mpif90 I get the error: /usr/bin/ld: cannot find -lnuma Adding --showme reveals that -I/home/terry/bin/Local/include -pthread -I/home/terry/bin/Local/lib is added to the compile early in the aggregated ifort command, and -L/home/terry/bin/Local/lib -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte -lopen-pal -lpbs -lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl is added to the end. I note than when compiling Open MPI -lnuma was visible in the gcc arguments, with no added -L. On this system libnuma.so exists in /usr/lib64. My (somewhat long!) configure command was ./configure --enable-static --disable-shared --prefix=/home/terry/bin/Local --enable-picky --disable- heterogeneous --without-slurm --without-alps --without-xgrid --without-sge --without-loadleveler --without-lsf F77=ifort Should mpif90 have bundled a -L/usr/lib64 in there somewhere? Regards Terry -- Dr. Terry Frankcombe Research School of Chemistry, Australian National University Ph: (+61) 0417 163 509Skype: terry.frankcombe -- Prentice ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] metahosts (like in MP-MPICH)
On Mar 4, 2009, at 11:38 AM, Yury Tarasievich wrote: I'm not quite sure what an MP-MPICH meta host is. Open MPI allows you to specify multiple hosts in a hostfile and run a single MPI job across all of them, assuming they're connected by at least some common TCP network. What I need is one MPI job put for distributed computation on several actual machines, connected by TCP/IP (so, kind of cluster computation). Machines may have heterogenous OSes on them (MP-MPICH accounts for that with its HETERO option). I'm somewhat new to MPI. It's possible, that what I describe is an inherent option of MPI implementations. Please advise. Yes, pretty much all MPI implementations support a single job spanning multiple hosts. Open MPI also supports heterogeneity of data representation if you use the --enable-heterogeneous flag to OMPI's configure. In general, you need both OMPI and your application compiled natively for each platform. One easy way to do this is to install Open MPI locally on each node in the same filesystem location (e.g., /opt/ openmpi-). You also want exactly the same version of Open MPI on all nodes. -- Jeff Squyres Cisco Systems
Re: [OMPI users] metahosts (like in MP-MPICH)
Jeff Squyres wrote: ... In general, you need both OMPI and your application compiled natively for each platform. One easy way to do this is to install Open MPI locally on each node in the same filesystem location (e.g., /opt/openmpi-). You also want exactly the same version of Open MPI on all nodes. Thanks for the tip, I'll try this! --
Re: [OMPI users] libnuma under ompi 1.3
Terry Frankcombe wrote: Having just downloaded and installed Open MPI 1.3 with ifort and gcc, I merrily went off to compile my application. In my final link with mpif90 I get the error: /usr/bin/ld: cannot find -lnuma Adding --showme reveals that -I/home/terry/bin/Local/include -pthread -I/home/terry/bin/Local/lib is added to the compile early in the aggregated ifort command, and -L/home/terry/bin/Local/lib -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte -lopen-pal -lpbs -lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl is added to the end. I note than when compiling Open MPI -lnuma was visible in the gcc arguments, with no added -L. On this system libnuma.so exists in /usr/lib64. My (somewhat long!) configure command was You shouldn't have to. The runtime loader should look inside of /usr/lib64 by itself. Unless of course, you've built either your application or OpenMPI using a 32-bit Intel complier instead (say fc instead of fce). In that case the runtime loader would look inside of /usr/lib to find libnuma, rather then /usr/lib64. Are you sure you are using the 64-bit version of the Intel compilier? If you intend to use the 32-bit version of the compilier, and OpenMPI is 32-bits you may just need to install the numactl.i386 and numactl.x86_64 RPMS. -Joshua Bernstein Senior Software Engineer Penguin Computing
Re: [OMPI users] openib RETRY EXCEEDED ERROR
On Mar 1, 2009, at 7:24 PM, Brett Pemberton wrote: I'd appreciate some advice on if I'm using OFED correctly. I'm running OFED 1.4, however not the kernel modules, just userland. Is this a bad idea? I believe so. I'm not a kernel guy, but I've always used the userland bits matched with the corresponding kernel bits. If nothing else, getting them to match would eliminate one possible source of errors. Basically, I recompile the ofed src.rpms for: dapl, libibcm, libibcommon, libibmad, libibumad, libibverbs, libmthca, librdmacm, libsdp, mstflint And install onto CentOS, upgrading the in-distro versions. Should I also be compiling ofa_kernel ? Could this be causing problems ? ...could be? I don't really know. That would be a better question for the gene...@lists.openfabrics.org list. As explained off-list, I'm running the most recent firmware for my cards, although the release is quite old: hca_id: mthca0 fw_ver: 1.2.0 I *believe* that's fairly ancient. You might want to check the support Mellanox web site and see if there's anything more recent for your HCA. -- Jeff Squyres Cisco Systems
Re: [OMPI users] threading bug?
On Feb 27, 2009, at 1:56 PM, Mahmoud Payami wrote: I am using intel lc_prof-11 (and its own mkl) and have built openmpi-1.3.1 with connfigure options: "FC=ifort F77=ifort CC=icc CXX=icpc". Then I have built my application. The linux box is 2Xamd64 quad. In the middle of running of my application (after some 15 iterations), I receive the message and stops. I tried to configure openmpi using "--disable-mpi-threads" but it automatically assumes "posix". This doesn't sound like a threading problem, thankfully. Open MPI has two levels of threading issues: - whether MPI_THREAD_MULTIPLE is supported or not (which is what -- enable|disable-mpi-threads does) - whether thread support is present at all on the system (e.g., solaris or posix threads) You see "posix" in the configure output mainly because OMPI still detects that posix threads are available on the system. It doesn't necessarily mean that threads will be used in your application's run. This problem does not happen in openmpi-1.2.9. Any comment is highly appreciated. Best regards, mahmoud payami [hpc1:25353] *** Process received signal *** [hpc1:25353] Signal: Segmentation fault (11) [hpc1:25353] Signal code: Address not mapped (1) [hpc1:25353] Failing at address: 0x51 [hpc1:25353] [ 0] /lib64/libpthread.so.0 [0x303be0dd40] [hpc1:25353] [ 1] /opt/openmpi131_cc/lib/ openmpi/mca_pml_ob1.so [0x2e350d96] [hpc1:25353] [ 2] /opt/openmpi131_cc/lib/ openmpi/mca_pml_ob1.so [0x2e3514a8] [hpc1:25353] [ 3] /opt/openmpi131_cc/lib/openmpi/mca_btl_sm.so [0x2eb7c72a] [hpc1:25353] [ 4] /opt/openmpi131_cc/lib/libopen-pal.so. 0(opal_progress+0x89) [0x2b42b7d9] [hpc1:25353] [ 5] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so [0x2e34d27c] [hpc1:25353] [ 6] /opt/openmpi131_cc/lib/libmpi.so.0(PMPI_Recv +0x210) [0x2af46010] [hpc1:25353] [ 7] /opt/openmpi131_cc/lib/libmpi_f77.so.0(mpi_recv +0xa4) [0x2acd6af4] [hpc1:25353] [ 8] /opt/QE131_cc/bin/pw.x(parallel_toolkit_mp_zsqmred_ +0x13da) [0x513d8a] [hpc1:25353] [ 9] /opt/QE131_cc/bin/pw.x(pcegterg_+0x6c3f) [0x6667ff] [hpc1:25353] [10] /opt/QE131_cc/bin/pw.x(diag_bands_+0xb9e) [0x65654e] [hpc1:25353] [11] /opt/QE131_cc/bin/pw.x(c_bands_+0x277) [0x6575a7] [hpc1:25353] [12] /opt/QE131_cc/bin/pw.x(electrons_+0x53f) [0x58a54f] [hpc1:25353] [13] /opt/QE131_cc/bin/pw.x(MAIN__+0x1fb) [0x458acb] [hpc1:25353] [14] /opt/QE131_cc/bin/pw.x(main+0x3c) [0x4588bc] [hpc1:25353] [15] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303b21d8a4] [hpc1:25353] [16] /opt/QE131_cc/bin/pw.x(realloc+0x1b9) [0x4587e9] [hpc1:25353] *** End of error message *** -- mpirun noticed that process rank 6 with PID 25353 on node hpc1 exited on signal 11 (Segmentation fault). -- What this stack trace tells us is that Open MPI crashed somewhere while trying to use shared memory for message passing, but it doesn't really tell us much else. It's not clear, either, whether this is OMPI's fault or your app's fault (or something else). Can you run your application through a memory-checking debugger to see if anything obvious pops out? -- Jeff Squyres Cisco Systems
Re: [OMPI users] mpirun problem
Sorry for the delay in replying; the usual INBOX deluge keeps me from being timely in replying to all mails... More below. On Feb 24, 2009, at 6:52 AM, Jovana Knezevic wrote: I'm new to MPI, so I'm going to explain my problem in detail I'm trying to compile a simple application using mpicc (on SUSE 10.0) and run it - compilation passes well, but mpirun is the problem. So, let's say the program is called 1.c, I tried the following: mpicc -o 1 1.c (and, just for the case, after problems with mpirun, I tried the following, too) mpicc --showme:compile mpicc --showme:link mpicc -I/include -pthread 1.c -pthread -I/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl -o 1 Both versions (wih or without flags) produced executables as expected (so, when I write: ./1 it executes in expected manner), Good. but when I try this: mpirun -np 4 ./1, it terminates giving the following message: ssh: (none): Name or service not known -- A daemon (pid 6877) died unexpectedly with status 255 while attempting to launch so we are aborting. That's fun; it seems like OMPI is not recognizing localhost properly. Can you use the --debug-daemons and --leave-session-attached options to mpirun and see what output you get? -- Jeff Squyres Cisco Systems
Re: [OMPI users] mpirun problem
I suppose one initial question is: what version of Open MPI are you running? OMPI 1.3 should not be attempting to ssh a daemon on a local job like this - OMPI 1.2 -will-, so it is important to know which one we are talking about. Just do "mpirun --version" and it should tell you. Ralph On Mar 4, 2009, at 1:09 PM, Jeff Squyres wrote: Sorry for the delay in replying; the usual INBOX deluge keeps me from being timely in replying to all mails... More below. On Feb 24, 2009, at 6:52 AM, Jovana Knezevic wrote: I'm new to MPI, so I'm going to explain my problem in detail I'm trying to compile a simple application using mpicc (on SUSE 10.0) and run it - compilation passes well, but mpirun is the problem. So, let's say the program is called 1.c, I tried the following: mpicc -o 1 1.c (and, just for the case, after problems with mpirun, I tried the following, too) mpicc --showme:compile mpicc --showme:link mpicc -I/include -pthread 1.c -pthread -I/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl -o 1 Both versions (wih or without flags) produced executables as expected (so, when I write: ./1 it executes in expected manner), Good. but when I try this: mpirun -np 4 ./1, it terminates giving the following message: ssh: (none): Name or service not known -- A daemon (pid 6877) died unexpectedly with status 255 while attempting to launch so we are aborting. That's fun; it seems like OMPI is not recognizing localhost properly. Can you use the --debug-daemons and --leave-session-attached options to mpirun and see what output you get? -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] RETRY EXCEEDED ERROR
I found several reports on the openmpi users mailing list from users, who need to bump up the default value for btl_openib_ib_timeout. We also have some applications on our cluster, that have problems, unless we set this value from the default 10 to 15: [24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174 to: shc175 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 250450816 opcode 11048 qp_idx 3 This is seen with OpenMPI 1.3 and OpenFabrics 1.4. Is this normal or is it an indicator of other problems, maybe related to hardware? Are there other parameters that need to be looked at too? Thanks for any insight on this! Regards, Jan Lindheim
Re: [OMPI users] Gamess with openmpi
Sorry for the delay in replying -- INBOX deluge makes me miss emails on the users list sometimes. I'm unfortunately not familiar with gamess -- have you checked with their support lists or documentation? Note that Open MPI's IB progression engine will spin hard to make progress for message passing. Specifically, if you have processes that are "blocking" in message passing calls, those processes will actually be spinning trying to make progress (vs. actually blocking in the kernel). So if you overload your hosts -- meaning that you run more Open MPI jobs than there are cores -- you could well experience dramatic slowdown in overall performance because every MPI job will be competing for CPU cycles. On Feb 24, 2009, at 4:57 AM, Thomas Exner wrote: Dear all: Because I am new to this list, I would like to introduce myself as Thomas Exner and please excuse silly questions, because I am only a chemist. And now my problem, with which I am fiddling around for almost a week: I try to use gamess with openmpi on infiniband. There is a good description on how to compile it with mpi and it can be done, even if it is not easy. But then on run time everything gets weird. The specialty of gamess is that it runs twice as much mpi jobs than used for the computation. The second half is used as data server, requiring data but with very little cpu load. Each one of these data servers is connected to a specific compute job. Therefore, these two corresponding jobs have to be run on the same node. On one node everything is fine (2x4core machines in my case), because all the jobs are guarantied to run on this node. If I try two nodes, at the beginning also everything is fine. 8 compute jobs and 8 data server are running on each machine. But after a short while, the entire set of processes (16) on the first node start to accumulate CPU time, with nothing useful happening. The second node's processes go entirely to sleep. Is it possible that all the compute jobs are for some reason been transfered to the first node? This would explain the load of 16 on the first and 0 on the second node, because 16 compute jobs (100 % cpu load) and 16 data servers (almost 0% load) are running, respectively. Strange thing is also that the same version runs on gigabit and myrinet fine. It would be great if somebody could help me on that. If you need more information, I will be happy to share them with you. Thanks very much. Thomas ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Bug reporting [was: OpenMPI 1.3]
Sorry for the delay; a bunch of higher priority stuff got in the way of finishing this thread. Anyhoo... On Feb 24, 2009, at 4:24 AM, Olaf Lenz wrote: I think it would be also sufficient to place a short text and link to the Trac page, so that the developers that want to use the "Bug Tracking" link to get to Trac do not have to click once more, but if this is OK for you, then it's fine. This is ok for us. I think most of us have trac either bookmarked or sufficiently active in our history such that Firefox's awesome bar (or equivalent in other browsers) just pick it up and go directly there without going through the "Bug tracking" link on the OMPI web site. Another option would be to simply give the link a less alluring name, like "Trac Bug Tracking System" or "Issue Tracker", or just "Trac". Should bugs really be reported directly to the developer's list (as stated in 1. on the new page)? Or to the user's mailing list if they are not sure that it is a bug? Good point; I softened the language in that first bullet. -- Jeff Squyres Cisco Systems
Re: [OMPI users] RETRY EXCEEDED ERROR
This *usually* indicates a physical / layer 0 problem in your IB fabric. You should do a diagnostic on your HCAs, cables, and switches. Increasing the timeout value should only be necessary on very large IB fabrics and/or very congested networks. On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote: I found several reports on the openmpi users mailing list from users, who need to bump up the default value for btl_openib_ib_timeout. We also have some applications on our cluster, that have problems, unless we set this value from the default 10 to 15: [24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174 to: shc175 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 250450816 opcode 11048 qp_idx 3 This is seen with OpenMPI 1.3 and OpenFabrics 1.4. Is this normal or is it an indicator of other problems, maybe related to hardware? Are there other parameters that need to be looked at too? Thanks for any insight on this! Regards, Jan Lindheim ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] RETRY EXCEEDED ERROR
On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote: > This *usually* indicates a physical / layer 0 problem in your IB > fabric. You should do a diagnostic on your HCAs, cables, and switches. > > Increasing the timeout value should only be necessary on very large IB > fabrics and/or very congested networks. Thanks Jeff! What is considered to be very large IB fabrics? I assume that with just over 180 compute nodes, our cluster does not fall into this category. Jan > > > On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote: > > >I found several reports on the openmpi users mailing list from users, > >who need to bump up the default value for btl_openib_ib_timeout. > >We also have some applications on our cluster, that have problems, > >unless we set this value from the default 10 to 15: > > > >[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174 > >to: shc175 > >error polling LP CQ with status RETRY EXCEEDED ERROR status number > >12 for > >wr_id 250450816 opcode 11048 qp_idx 3 > > > >This is seen with OpenMPI 1.3 and OpenFabrics 1.4. > > > >Is this normal or is it an indicator of other problems, maybe > >related to > >hardware? > >Are there other parameters that need to be looked at too? > > > >Thanks for any insight on this! > > > >Regards, > >Jan Lindheim > >___ > >users mailing list > >us...@open-mpi.org > >http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] RETRY EXCEEDED ERROR
On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote: On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote: > This *usually* indicates a physical / layer 0 problem in your IB > fabric. You should do a diagnostic on your HCAs, cables, and switches. > > Increasing the timeout value should only be necessary on very large IB > fabrics and/or very congested networks. Thanks Jeff! What is considered to be very large IB fabrics? I assume that with just over 180 compute nodes, our cluster does not fall into this category. I was a little misleading in my note -- I should clarify. It's really congestion that matters, not the size of the fabric. Congestion is potentially more likely to happen in larger fabrics, since packets may have to flow through more switches, there's likely more apps running on the cluster, etc. But it's all very application/cluster-specific; only you can know if your fabric is heavily congested based on what you run on it, etc. -- Jeff Squyres Cisco Systems
Re: [OMPI users] RETRY EXCEEDED ERROR
On Wed, Mar 04, 2009 at 04:34:49PM -0500, Jeff Squyres wrote: > On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote: > > >On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote: > >> This *usually* indicates a physical / layer 0 problem in your IB > >> fabric. You should do a diagnostic on your HCAs, cables, and > >switches. > >> > >> Increasing the timeout value should only be necessary on very > >large IB > >> fabrics and/or very congested networks. > > > >Thanks Jeff! > >What is considered to be very large IB fabrics? > >I assume that with just over 180 compute nodes, > >our cluster does not fall into this category. > > > > I was a little misleading in my note -- I should clarify. It's really > congestion that matters, not the size of the fabric. Congestion is > potentially more likely to happen in larger fabrics, since packets may > have to flow through more switches, there's likely more apps running > on the cluster, etc. But it's all very application/cluster-specific; > only you can know if your fabric is heavily congested based on what > you run on it, etc. > > -- > Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Thanks again Jeff! Time to dig up diagnostics tools and look at port statistics. Jan
Re: [OMPI users] libnuma under ompi 1.3
Thanks to everyone who contributed. I no longer think this is Open MPI's problem. This system is just stupid. Everything's 64 bit (which various probes with file confirm). There's no icc, so I can't test with that. gcc finds libnuma without -L. (Though a simple gcc -lnuma -Wl,-t reports that libnuma is found through the rather convoluted path /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.4/../../../../lib64/libnuma.so.) ifort -lnuma can't find libnuma.so, but then ifort -L/usr/lib64 -lnuma can't find it either! While everything else points to some mix up with linking search paths, that last result confuses me greatly. (Unless there's some subtlety with libnuma.so being a link to libnuma.so.1.) I can compile my app by replicating mpif90's --showme output directly on the command line, with -lnuma replaced explicitly with /usr/lib64/libnuma.so. Then, even though I've told ifort -static, ldd gives the three lines: libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x2b3f58a3c000) libc.so.6 => /lib64/tls/libc.so.6 (0x2b3f58b42000) /lib/ld64.so.1 => /lib/ld64.so.1 (0x2b3f58925000) While I don't understand what's going on here, I now have a working binary. It's the only app I use on this machine, so I'm no longer concerned. All other machines on which I use Open MPI work as expected out of the box. My workaround here is sufficient. Once more, thanks for the suggestions. I think this machine is just pathological. Ciao Terry
Re: [OMPI users] libnuma under ompi 1.3
Terry, Is there a libnuma.a on your system. If not the -static flag to ifort won't do any thing because there isn't a static library for it to link against. Doug Reeder On Mar 4, 2009, at 6:06 PM, Terry Frankcombe wrote: Thanks to everyone who contributed. I no longer think this is Open MPI's problem. This system is just stupid. Everything's 64 bit (which various probes with file confirm). There's no icc, so I can't test with that. gcc finds libnuma without -L. (Though a simple gcc -lnuma -Wl,-t reports that libnuma is found through the rather convoluted path /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.4/../../../../lib64/ libnuma.so.) ifort -lnuma can't find libnuma.so, but then ifort -L/usr/lib64 -lnuma can't find it either! While everything else points to some mix up with linking search paths, that last result confuses me greatly. (Unless there's some subtlety with libnuma.so being a link to libnuma.so.1.) I can compile my app by replicating mpif90's --showme output directly on the command line, with -lnuma replaced explicitly with /usr/lib64/libnuma.so. Then, even though I've told ifort - static, ldd gives the three lines: libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x2b3f58a3c000) libc.so.6 => /lib64/tls/libc.so.6 (0x2b3f58b42000) /lib/ld64.so.1 => /lib/ld64.so.1 (0x2b3f58925000) While I don't understand what's going on here, I now have a working binary. It's the only app I use on this machine, so I'm no longer concerned. All other machines on which I use Open MPI work as expected out of the box. My workaround here is sufficient. Once more, thanks for the suggestions. I think this machine is just pathological. Ciao Terry ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] mlx4 error - looking for guidance
Evening everyone, I'm running a CFD code on IB and I've encountered an error I'm not sure about and I'm looking for some guidance on where to start looking. Here's the error: mlx4: local QP operation err (QPN 260092, WQE index 9a9e, vendor syndrome 6f, opcode = 5e) [0,1,6][btl_openib_component.c:1392:btl_openib_component_progress] from compute-2-0.local to: compute-2-0.local erro r polling HP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 37742320 opcode 0 mpirun noticed that job rank 0 with PID 21220 on node compute-2-0.local exited on signal 15 (Terminated). 78 additional processes aborted (not shown) This is openmpi-1.2.9rc2 (sorry - need to upgrade to 1.3.0). The code works correctly for smaller cases, but when I run larger cases I get this error. I'm heading to bed but I'll check email tomorrow (so to sleep and run but it's been a long day). TIA! Jeff