Help : unsubscribe
-------------------------------------------- On Fri, 20/9/13, users-requ...@open-mpi.org <users-requ...@open-mpi.org> wrote: Subject: users Digest, Vol 2685, Issue 2 To: us...@open-mpi.org Date: Friday, 20 September, 2013, 11:00 AM Send users mailing list submissions to us...@open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit http://www.open-mpi.org/mailman/listinfo.cgi/users or, via email, send a message with subject or body 'help' to users-requ...@open-mpi.org You can reach the person managing the list at users-ow...@open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. error building openmpi-1.7.3a1r29213 on Solaris (Siegmar Gross) 2. intermittent node file error running with torque/maui integration (Noam Bernstein) 3. Re: intermittent node file error running with torque/maui integration (Noam Bernstein) 4. Re: intermittent node file error running with torque/maui integration (Noam Bernstein) 5. Re: intermittent node file error running with torque/maui integration (Reuti) 6. Re: intermittent node file error running with torque/maui integration (Noam Bernstein) 7. Re: intermittent node file error running with torque/maui integration (Noam Bernstein) 8. Debugging Runtime/Ethernet Problems (Lloyd Brown) 9. Re: Debugging Runtime/Ethernet Problems (Elken, Tom) 10. Re: Debugging Runtime/Ethernet Problems (Ralph Castain) 11. Re: compilation aborted for Handler.cpp (code 2) (Jeff Squyres (jsquyres)) 12. Re: Debugging Runtime/Ethernet Problems (Jeff Squyres (jsquyres)) 13. Re: compilation aborted for Handler.cpp (code 2) (Jeff Squyres (jsquyres)) 14. Re: intermittent node file error running with torque/maui integration (Gus Correa) 15. Re: error building openmpi-1.7.3a1r29213 on Solaris (Jeff Squyres (jsquyres)) ---------------------------------------------------------------------- Message: 1 Date: Fri, 20 Sep 2013 13:00:41 +0200 (CEST) From: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> To: us...@open-mpi.org Subject: [OMPI users] error building openmpi-1.7.3a1r29213 on Solaris Message-ID: <201309201100.r8kb0ftr022...@tyr.informatik.hs-fulda.de> Content-Type: TEXT/plain; charset=us-ascii Hi, I tried to install openmpi-1.7.3a1r29213 on "openSuSE Linux 12.1", "Solaris 10 x86_64", and "Solaris 10 sparc" with "Sun C 5.12" and gcc-4.8.0 in 64-bit mode. Unfortunately "make" breaks with the same error for both compilers on both Solaris platforms. tyr openmpi-1.7.3a1r29213-SunOS.sparc.64_cc 126 tail -10 \ log.make.SunOS.sparc.64_cc Making all in mca/if/posix_ipv4 make[2]: Entering directory `.../opal/mca/if/posix_ipv4' CC if_posix.lo "../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c", line 277: undefined struct/union member: ifr_mtu cc: acomp failed for ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c make[2]: *** [if_posix.lo] Error 1 make[2]: Leaving directory `.../opal/mca/if/posix_ipv4' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `.../opal' make: *** [all-recursive] Error 1 tyr openmpi-1.7.3a1r29213-SunOS.sparc.64_gcc 131 tail -12 \ log.make.SunOS.sparc.64_gcc Making all in mca/if/posix_ipv4 make[2]: Entering directory `.../opal/mca/if/posix_ipv4' CC if_posix.lo ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c: In function 'if_posix_open': ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c: 277:31: error: 'struct ifreq' has no member named 'ifr_mtu' intf->if_mtu = ifr->ifr_mtu; ^ make[2]: *** [if_posix.lo] Error 1 make[2]: Leaving directory `.../opal/mca/if/posix_ipv4' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `.../opal' make: *** [all-recursive] Error 1 I have had this problem before and Jeff solved it. Here is my old e-mail. Date: Tue, 7 May 2013 19:38:11 +0200 (CEST) From: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> Subject: Re: commit/ompi-java: jsquyres: Up to SVN r28392 To: jsquy...@cisco.com Cc: siegmar.gr...@informatik.hs-fulda.de MIME-Version: 1.0 Content-MD5: O1pjPK/1JiMXXZ/EHyMU0Q== X-HRZ-JLUG-MailScanner-Information: Passed JLUG virus check X-HRZ-JLUG-MailScanner: No virus found X-Envelope-From: fd1...@tyr.informatik.hs-fulda.de X-Spam-Status: No Hello Jeff > Ok, I made a change in the OMPI trunk that should fix this: > > https://svn.open-mpi.org/trac/ompi/changeset/28460 > > And I pulled it into the ompi-java hg repo. Could you give > it a whirl and let me know if this works for you? Perfect :-)))). Now I can build Open MPI on Solaris without "#if 0" :-). Thank you very much for your help. "make check" still produces the old bus error on Solaris Sparc. All checks are fine on Linux and Solaris x86_64. ... PASS: ddt_test /bin/bash: line 5: 12453 Bus Error ${dir}$tst FAIL: ddt_raw ======================================================== 1 of 5 tests failed Please report to http://www.open-mpi.org/community/help/ ======================================================== make[3]: *** [check-TESTS] Error 1 ... Kind regards Siegmar > On May 6, 2013, at 7:20 AM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote: > > > Hello Jeff > > > >>>> "../../../../../ompi-java/opal/mca/if/posix_ipv4/if_posix.c", > >>>> line 279: undefined struct/union member: ifr_mtu > >>>> > >>>> Sigh. Solaris kills me. :-\ > >>>> > >>>> Just so I understand -- Solaris has SIOCGIFMTU, but doesn't > >>>> have struct ifreq.ifr_mtu? > >>> > >>> I found SIOCGIFMTU in sys/sockio.h with the following comment. > >> > >> Is there a Solaris-defined constant we can use here to know > >> that we're on Solaris? If so, I can effectively make that code > >> only be there if SIOCFIGMTU exists and we're not on Solaris. > > > > I searched our header files for "sunos" and "solaris" with > > "-ignore-case", but didn't find anything useful. You have a very > > minimal environment, if you use "sh" and you would have a useful > > environment variable, if you use "tcsh". > > > > tyr java 321 su - > > ... > > # env > > HOME=/root > > HZ= > > LANG=C > > LC_ALL=C > > LOGNAME=root > > MAIL=/var/mail/root > > PATH=/usr/sbin:/usr/bin > > SHELL=/sbin/sh > > TERM=dtterm > > TZ=Europe/Berlin > > # tcsh > > # env | grep TYPE > > HOSTTYPE=sun4 > > OSTYPE=solaris > > MACHTYPE=sparc > > # > > > > The best solution would be "uname -s", if that is possible. > > > > # /usr/bin/uname -s > > SunOS I would be grateful, if somebody can solve the problem once more. Kind regards Siegmar ------------------------------ Message: 2 Date: Fri, 20 Sep 2013 09:55:44 -0400 From: Noam Bernstein <noam.bernst...@nrl.navy.mil> To: Open MPI Users <us...@open-mpi.org> Subject: [OMPI users] intermittent node file error running with torque/maui integration Message-ID: <b695c61e-461c-47e1-8634-fb492ca04...@nrl.navy.mil> Content-Type: text/plain; charset=us-ascii Hi - we've been using openmpi for a while, but only for the last few months with torque/maui. Intermittently (maybe 1/10 jobs), we get mpi jobs that fail with the error: [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 142 [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 82 [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 149 [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 99 [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 194 This is completely unrepeatable - resubmitting the same job almost always works the second time around. The line appears to be associated with looking for the torque/maui generated node file, and when I do something like echo $PBS_NODEFILE cat $PBS_NODEFILE it appears that the file is present and correct. We're running OpenMPI 1.6.4, configured with ./configure \ --prefix=${DEST} \ --with-tm=/usr/local/torque \ --enable-mpirun-prefix-by-default \ --with-openib=/usr \ --with-openib-libdir=/usr/lib64 Has anyone seen anything like this before, or has any ideas of what might be happening? It appears to be a line where openmpi looks for the PBS node file, which is on a local filesystem (e.g. PBS_NODEFILE=/var/spool/torque/aux//4600.tin). thanks, Noam Noam Bernstein Center for Computational Materials Science NRL Code 6390 noam.bernst...@nrl.navy.mil ------------------------------ Message: 3 Date: Fri, 20 Sep 2013 10:04:43 -0400 From: Noam Bernstein <noam.bernst...@nrl.navy.mil> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] intermittent node file error running with torque/maui integration Message-ID: <75e58dcb-47a5-45ac-a9fb-35c0478c2...@nrl.navy.mil> Content-Type: text/plain; charset=us-ascii On Sep 20, 2013, at 9:55 AM, Noam Bernstein <noam.bernst...@nrl.navy.mil> wrote: > > This is completely unrepeatable - resubmitting the same job almost > always works the second time around. The line appears to be > associated with looking for the torque/maui generated node file, > and when I do something like > echo $PBS_NODEFILE > cat $PBS_NODEFILE > it appears that the file is present and correct. Never mind - I was sure that my earlier tests showed that the $PBS_NODEFILE was there, but now it seems like every time the job fails it's because this file really is missing. Time to check why torque isn't always creating the nodefile. Noam ------------------------------ Message: 4 Date: Fri, 20 Sep 2013 10:12:39 -0400 From: Noam Bernstein <noam.bernst...@nrl.navy.mil> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] intermittent node file error running with torque/maui integration Message-ID: <a3a2843b-6af1-4e0d-aac2-df0b55a6a...@nrl.navy.mil> Content-Type: text/plain; charset=us-ascii On Sep 20, 2013, at 10:04 AM, Noam Bernstein <noam.bernst...@nrl.navy.mil> wrote: > > Never mind - I was sure that my earlier tests showed that the $PBS_NODEFILE > was there, but now it seems like every time the job fails it's because this > file really is missing. Time to check why torque isn't always creating > the nodefile. Even weirder now - most of the time jobs fail it's because the PBS_NODEFILE is really missing. But a small fraction of the time (< 1%) the PBS_NODEFILE is there, but mpirun still fails in the way my original message specified. Has anyone ever seen anything like this before? Noam ------------------------------ Message: 5 Date: Fri, 20 Sep 2013 16:22:31 +0200 From: Reuti <re...@staff.uni-marburg.de> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] intermittent node file error running with torque/maui integration Message-ID: <fe881348-7073-4a81-86ab-de1968a01...@staff.uni-marburg.de> Content-Type: text/plain; charset=us-ascii Hi, Am 20.09.2013 um 16:12 schrieb Noam Bernstein: > On Sep 20, 2013, at 10:04 AM, Noam Bernstein <noam.bernst...@nrl.navy.mil> wrote: > >> Never mind - I was sure that my earlier tests showed that the $PBS_NODEFILE >> was there, but now it seems like every time the job fails it's because this >> file really is missing. Time to check why torque isn't always creating >> the nodefile. > > Even weirder now - most of the time jobs fail it's because the PBS_NODEFILE > is really missing. But a small fraction of the time (< 1%) the PBS_NODEFILE > is there, but mpirun still fails in the way my original message specified. > > Has anyone ever seen anything like this before? Is the location for the spool directory local or shared by NFS? Disk full? -- Reuti ------------------------------ Message: 6 Date: Fri, 20 Sep 2013 10:36:21 -0400 From: Noam Bernstein <noam.bernst...@nrl.navy.mil> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] intermittent node file error running with torque/maui integration Message-ID: <ab5dde1f-887a-45cc-b0fe-0f8bf4d11...@nrl.navy.mil> Content-Type: text/plain; charset=us-ascii On Sep 20, 2013, at 10:22 AM, Reuti <re...@staff.uni-marburg.de> wrote: > > Is the location for the spool directory local or shared by NFS? Disk full? No - locally mounted, and far from full on all the nodes. Noam ------------------------------ Message: 7 Date: Fri, 20 Sep 2013 10:40:58 -0400 From: Noam Bernstein <noam.bernst...@nrl.navy.mil> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] intermittent node file error running with torque/maui integration Message-ID: <cbe710d4-2d84-4392-bd1e-85e8de5d5...@nrl.navy.mil> Content-Type: text/plain; charset=us-ascii On Sep 20, 2013, at 10:36 AM, Noam Bernstein <noam.bernst...@nrl.navy.mil> wrote: > > On Sep 20, 2013, at 10:22 AM, Reuti <re...@staff.uni-marburg.de> wrote: > >> >> Is the location for the spool directory local or shared by NFS? Disk full? > > No - locally mounted, and far from full on all the nodes. Another new observation, which may shift the focus to torque. I just rebooted some of the nodes that were showing this behavior. So far, none of them have shown it in a few hundred test jobs, while before at least 1-5 of each set of 100 had failures. Noam ------------------------------ Message: 8 Date: Fri, 20 Sep 2013 08:49:20 -0600 From: Lloyd Brown <lloyd_br...@byu.edu> To: Open MPI Users <us...@open-mpi.org> Subject: [OMPI users] Debugging Runtime/Ethernet Problems Message-ID: <523c6070.7000...@byu.edu> Content-Type: text/plain; charset=ISO-8859-1 Hi, all. We've got a couple of clusters running RHEL 6.2, and have several centrally-installed versions/compilations of OpenMPI. Some of the nodes have 4xQDR Infiniband, and all the nodes have 1 gigabit ethernet. I was gathering some bandwidth and latency numbers using the OSU/OMB tests, and noticed some weird behavior. When I run a simple "mpirun ./osu_bw" on a couple of IB-enabled node, I get numbers consistent with our IB speed (up to about 3800 MB/s), and when I run the same thing on two nodes with only Ethernet, I get speeds consistent with that (up to about 120 MB/s). So far, so good. The trouble is when I try to add some "--mca" parameters to force it to use TCP/Ethernet, the program seems to hang. I get the headers of the "osu_bw" output, but no results, even on the first case (1 byte payload per packet). This is occurring on both the IB-enabled nodes, and on the Ethernet-only nodes. The specific syntax I was using was: "mpirun --mca btl ^openib --mca btl_tcp_if_exclude ib0 ./osu_bw" The problem occurs at least with OpenMPI 1.6.3 compiled with GNU 4.4 compilers, with 1.6.3 compiled with Intel 13.0.1 compilers, and with 1.6.5 compiled with Intel 13.0.1 compilers. I haven't tested any other combinations yet. Any ideas here? It's very possible this is a system configuration problem, but I don't know where to look. At this point, any ideas would be welcome, either about the specific situation, or general pointers on mpirun debugging flags to use. I can't find much in the docs yet on run-time debugging for OpenMPI, as opposed to debugging the application. Maybe I'm just looking in the wrong place. Thanks, -- Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu ------------------------------ Message: 9 Date: Fri, 20 Sep 2013 15:05:14 +0000 From: "Elken, Tom" <tom.el...@intel.com> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] Debugging Runtime/Ethernet Problems Message-ID: <1182fb2b5679ce4b8bad97725f014bb73284e...@fmsmsx104.amr.corp.intel.com> Content-Type: text/plain; charset="us-ascii" > The trouble is when I try to add some "--mca" parameters to force it to > use TCP/Ethernet, the program seems to hang. I get the headers of the > "osu_bw" output, but no results, even on the first case (1 byte payload > per packet). This is occurring on both the IB-enabled nodes, and on the > Ethernet-only nodes. The specific syntax I was using was: "mpirun > --mca btl ^openib --mca btl_tcp_if_exclude ib0 ./osu_bw" When we want to run over TCP and IPoIB on an IB/PSM equipped cluster, we use: --mca btl sm --mca btl tcp,self --mca btl_tcp_if_exclude eth0 --mca btl_tcp_if_include ib0 --mca mtl ^psm based on this, it looks like the following might work for you: --mca btl sm,tcp,self --mca btl_tcp_if_exclude ib0 --mca btl_tcp_if_include eth0 --mca btl ^openib If you don't have ib0 ports configured on the IB nodes, probably you don't need the" --mca btl_tcp_if_exclude ib0." -Tom > > The problem occurs at least with OpenMPI 1.6.3 compiled with GNU 4.4 > compilers, with 1.6.3 compiled with Intel 13.0.1 compilers, and with > 1.6.5 compiled with Intel 13.0.1 compilers. I haven't tested any other > combinations yet. > > Any ideas here? It's very possible this is a system configuration > problem, but I don't know where to look. At this point, any ideas would > be welcome, either about the specific situation, or general pointers on > mpirun debugging flags to use. I can't find much in the docs yet on > run-time debugging for OpenMPI, as opposed to debugging the application. > Maybe I'm just looking in the wrong place. > > > Thanks, > > -- > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ------------------------------ Message: 10 Date: Fri, 20 Sep 2013 08:17:43 -0700 From: Ralph Castain <r...@open-mpi.org> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] Debugging Runtime/Ethernet Problems Message-ID: <917b367f-1687-4a91-b173-de1bba7c7...@open-mpi.org> Content-Type: text/plain; charset=us-ascii I don't think you are allowed to specify both include and exclude options at the same time as they conflict - you should either exclude ib0 or include eth0 (or whatever). My guess is that the various nodes are trying to communicate across disjoint networks. We've seen that before when, for example, eth0 on one node is on one subnet, and eth0 on another node is on a different subnet. You might look for that kind of arrangement. On Sep 20, 2013, at 8:05 AM, "Elken, Tom" <tom.el...@intel.com> wrote: >> The trouble is when I try to add some "--mca" parameters to force it to >> use TCP/Ethernet, the program seems to hang. I get the headers of the >> "osu_bw" output, but no results, even on the first case (1 byte payload >> per packet). This is occurring on both the IB-enabled nodes, and on the >> Ethernet-only nodes. The specific syntax I was using was: "mpirun >> --mca btl ^openib --mca btl_tcp_if_exclude ib0 ./osu_bw" > > When we want to run over TCP and IPoIB on an IB/PSM equipped cluster, we use: > --mca btl sm --mca btl tcp,self --mca btl_tcp_if_exclude eth0 --mca btl_tcp_if_include ib0 --mca mtl ^psm > > based on this, it looks like the following might work for you: > --mca btl sm,tcp,self --mca btl_tcp_if_exclude ib0 --mca btl_tcp_if_include eth0 --mca btl ^openib > > If you don't have ib0 ports configured on the IB nodes, probably you don't need the" --mca btl_tcp_if_exclude ib0." > > -Tom > >> >> The problem occurs at least with OpenMPI 1.6.3 compiled with GNU 4.4 >> compilers, with 1.6.3 compiled with Intel 13.0.1 compilers, and with >> 1.6.5 compiled with Intel 13.0.1 compilers. I haven't tested any other >> combinations yet. >> >> Any ideas here? It's very possible this is a system configuration >> problem, but I don't know where to look. At this point, any ideas would >> be welcome, either about the specific situation, or general pointers on >> mpirun debugging flags to use. I can't find much in the docs yet on >> run-time debugging for OpenMPI, as opposed to debugging the application. >> Maybe I'm just looking in the wrong place. >> >> >> Thanks, >> >> -- >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham Young University >> http://marylou.byu.edu >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ------------------------------ Message: 11 Date: Fri, 20 Sep 2013 15:28:48 +0000 From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] compilation aborted for Handler.cpp (code 2) Message-ID: <ef66bbeb19badc41ac8ccf5f684f07fc4f8bc...@xmb-rcd-x01.cisco.com> Content-Type: text/plain; charset="iso-8859-1" I can't tell if this is a busted compiler installation or not. The first error is: ----- /usr/include/c++/4.6.3/bits/stl_algobase.h(573): error: type name is not allowed const bool __simple = (__is_trivial(_ValueType1) ^ detected during: instantiation of "_BI2 std::__copy_move_backward_a2<_IsMove,_BI1,_BI2>(_BI1, _BI1, _BI2) [with _IsMove=false, _BI1=uint32_t={unsigned int} *, _BI2=uint32_t={unsigned int} *]" at line 625 instantiation of "_BI2 std::copy_backward(_BI1, _BI1, _BI2) [with _BI1=uint32_t={unsigned int} *, _BI2=uint32_t={unsigned int} *]" at line 315 of "/usr/include/c++/4.6.3/bits/vector.tcc" instantiation of "void std::vector<_Tp, _Alloc>::_M_insert_aux(__gnu_cxx::__normal_iterator<std::_Vector_base<_Tp, _Alloc>::_Tp_alloc_type::pointer, std::vector<_Tp, _Alloc>>, const _Tp &) [with _Tp=uint32_t={unsigned int}, _Alloc=std::allocator<uint32_t={unsigned int}>]" at line 834 of "/usr/include/c++/4.6.3/bits/stl_vector.h" instantiation of "void std::vector<_Tp, _Alloc>::push_back(const _Tp &) [with _Tp=uint32_t={unsigned int}, _Alloc=std::allocator<uint32_t={unsigned int}>]" at line 42 of "Handler.cpp" ----- I verified that OMPI 1.6.5 builds fine for me for icpc/13.1.0.146 Build 20130121 on RHEL 6. Perhaps you have some kind of bad interaction between your icpc installation and your local g++ installation...? On Sep 18, 2013, at 12:58 PM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote: > Please find attached again. > > On Tue, Sep 17, 2013 at 11:35 AM, Jeff Squyres (jsquyres) > <jsquy...@cisco.com> wrote: >> On Sep 16, 2013, at 9:00 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote: >> >>> I am trying to compile openmpi-1.6.5 on fc16.x86_64 with icc and ifort >>> but getting the subject error. config.out and make.out is attached. >>> Following command was used for configure >>> >>> ./configure CC=icc CXX=icpc FC=ifort F77=ifort F90=ifort >>> --prefix=/home/openmpi_gfortran -enable-mpi-f90 --enable-mpi-f77 |& >>> tee config.out >> >> I'm sorry; I can't open a .rar file. Can you send the logs compressed with a conventional compression program like gzip, bzip2, or zip? >> >> Thanks. >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Syed Ahsan Ali Bokhari > Electronic Engineer (EE) > > Research & Development Division > Pakistan Meteorological Department H-8/4, Islamabad. > Phone # off +92518358714 > Cell # +923155145014 > <logs.zip>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ------------------------------ Message: 12 Date: Fri, 20 Sep 2013 15:33:43 +0000 From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] Debugging Runtime/Ethernet Problems Message-ID: <ef66bbeb19badc41ac8ccf5f684f07fc4f8bc...@xmb-rcd-x01.cisco.com> Content-Type: text/plain; charset="us-ascii" Correct -- it doesn't make sense to specify both include *and* exclude: by specifying one, you're implicitly (but exactly/precisely) specifying the other. My suggestion would be to use positive notation, not negative notation. For example: mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 ... That way, you *know* you're only getting the TCP and self BTLs, and you *know* you're only getting eth0. If that works, then spread out from there, e.g.: mpirun --mca btl tcp,sm,self --mca btl_tcp_if_include eth0,eth1 ... E.g., also include the "sm" BTL (which is only used for shared memory communications between 2 procs on the same server, and is therefore useless for a 2-proc-across-2-server run of osu_bw, but you get the idea), but also use eth0 and eth1. And so on. The problem with using ^openib and/or btl_tcp_if_exclude is that you might end up using some BTLs and/or TCP interfaces that you don't expect, and therefore can run into problems. Make sense? On Sep 20, 2013, at 11:17 AM, Ralph Castain <r...@open-mpi.org> wrote: > I don't think you are allowed to specify both include and exclude options at the same time as they conflict - you should either exclude ib0 or include eth0 (or whatever). > > My guess is that the various nodes are trying to communicate across disjoint networks. We've seen that before when, for example, eth0 on one node is on one subnet, and eth0 on another node is on a different subnet. You might look for that kind of arrangement. > > > On Sep 20, 2013, at 8:05 AM, "Elken, Tom" <tom.el...@intel.com> wrote: > >>> The trouble is when I try to add some "--mca" parameters to force it to >>> use TCP/Ethernet, the program seems to hang. I get the headers of the >>> "osu_bw" output, but no results, even on the first case (1 byte payload >>> per packet). This is occurring on both the IB-enabled nodes, and on the >>> Ethernet-only nodes. The specific syntax I was using was: "mpirun >>> --mca btl ^openib --mca btl_tcp_if_exclude ib0 ./osu_bw" >> >> When we want to run over TCP and IPoIB on an IB/PSM equipped cluster, we use: >> --mca btl sm --mca btl tcp,self --mca btl_tcp_if_exclude eth0 --mca btl_tcp_if_include ib0 --mca mtl ^psm >> >> based on this, it looks like the following might work for you: >> --mca btl sm,tcp,self --mca btl_tcp_if_exclude ib0 --mca btl_tcp_if_include eth0 --mca btl ^openib >> >> If you don't have ib0 ports configured on the IB nodes, probably you don't need the" --mca btl_tcp_if_exclude ib0." >> >> -Tom >> >>> >>> The problem occurs at least with OpenMPI 1.6.3 compiled with GNU 4.4 >>> compilers, with 1.6.3 compiled with Intel 13.0.1 compilers, and with >>> 1.6.5 compiled with Intel 13.0.1 compilers. I haven't tested any other >>> combinations yet. >>> >>> Any ideas here? It's very possible this is a system configuration >>> problem, but I don't know where to look. At this point, any ideas would >>> be welcome, either about the specific situation, or general pointers on >>> mpirun debugging flags to use. I can't find much in the docs yet on >>> run-time debugging for OpenMPI, as opposed to debugging the application. >>> Maybe I'm just looking in the wrong place. >>> >>> >>> Thanks, >>> >>> -- >>> Lloyd Brown >>> Systems Administrator >>> Fulton Supercomputing Lab >>> Brigham Young University >>> http://marylou.byu.edu >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ------------------------------ Message: 13 Date: Fri, 20 Sep 2013 15:35:59 +0000 From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] compilation aborted for Handler.cpp (code 2) Message-ID: <ef66bbeb19badc41ac8ccf5f684f07fc4f8bc...@xmb-rcd-x01.cisco.com> Content-Type: text/plain; charset="iso-8859-1" Sorry for the delay replying -- I actually replied on the original thread yesterday, but it got hung up in my outbox and I didn't notice that it didn't actually go out until a few moments ago. :-( I'm *guessing* that this is a problem with your local icpc installation. Can you compile / run other C++ codes that use the STL with icpc? On Sep 20, 2013, at 6:59 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote: > Output of make V=1 is attached. Again same error. If intel compiler is > using C++ headers from gfortran then how can we avoid this. > > On Fri, Sep 20, 2013 at 11:07 AM, Bert Wesarg > <bert.wes...@googlemail.com> wrote: >> Hi, >> >> On Fri, Sep 20, 2013 at 4:49 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote: >>> I am trying to compile openmpi-1.6.5 on fc16.x86_64 with icc and ifort >>> but getting the subject error. config.out and make.out is attached. >>> Following command was used for configure >>> >>> ./configure CC=icc CXX=icpc FC=ifort F77=ifort F90=ifort >>> --prefix=/home/openmpi_gfortran -enable-mpi-f90 --enable-mpi-f77 |& >>> tee config.out >> >> could you also run make with 'make V=1' and send the output. Anyway it >> looks like the intel compiler uses the C++ headers from GCC 4.6.3 and >> I don't know if this is supported. >> >> Bert >> >>> Please help/advise. >>> Thank you and best regards >>> Ahsan >>> > > > > -- > Syed Ahsan Ali Bokhari > Electronic Engineer (EE) > > Research & Development Division > Pakistan Meteorological Department H-8/4, Islamabad. > Phone # off +92518358714 > Cell # +923155145014 > <makeV.zip>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ------------------------------ Message: 14 Date: Fri, 20 Sep 2013 11:52:11 -0400 From: Gus Correa <g...@ldeo.columbia.edu> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] intermittent node file error running with torque/maui integration Message-ID: <523c6f2b.60...@ldeo.columbia.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi Noam Could it be that Torque, or probably more likely NFS, is too slow to create/make available the PBS_NODEFILE? What if you insert a "sleep 2", or whatever number of seconds you want, before the mpiexec command line? Or maybe better, a "ls -l $PBS_NODEFILE; cat $PBS_NODEFILE", just to make sure the file it is available and filled with the node list, before mpiexec takes over? My two cents, Gus Correa On 09/20/2013 09:55 AM, Noam Bernstein wrote: > Hi - we've been using openmpi for a while, but only for the last few months > with torque/maui. Intermittently (maybe 1/10 jobs), we get mpi jobs that fail with the error: > > [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 142 > [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 82 > [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 149 > [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 99 > [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 194 > > This is completely unrepeatable - resubmitting the same job almost > always works the second time around. The line appears to be > associated with looking for the torque/maui generated node file, > and when I do something like > echo $PBS_NODEFILE > cat $PBS_NODEFILE > it appears that the file is present and correct. > > We're running OpenMPI 1.6.4, configured with > ./configure \ > --prefix=${DEST} \ > --with-tm=/usr/local/torque \ > --enable-mpirun-prefix-by-default \ > --with-openib=/usr \ > --with-openib-libdir=/usr/lib64 > > Has anyone seen anything like this before, or has any ideas of what might > be happening? It appears to be a line where openmpi looks for > the PBS node file, which is on a local filesystem (e.g. PBS_NODEFILE=/var/spool/torque/aux//4600.tin). > > thanks, > Noam > > > > Noam Bernstein > Center for Computational Materials Science > NRL Code 6390 > noam.bernst...@nrl.navy.mil > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ------------------------------ Message: 15 Date: Fri, 20 Sep 2013 15:52:43 +0000 From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> To: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de>, Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] error building openmpi-1.7.3a1r29213 on Solaris Message-ID: <ef66bbeb19badc41ac8ccf5f684f07fc4f8bc...@xmb-rcd-x01.cisco.com> Content-Type: text/plain; charset="us-ascii" Looks like Ralph noticed that we fixed this on the trunk and forgot to bring it over to v1.7. I just committed it on v1.7 in r29215. Give it a whirl in tonight's v1.7 nightly tarball. On Sep 20, 2013, at 7:00 AM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote: > Hi, > > I tried to install openmpi-1.7.3a1r29213 on "openSuSE Linux 12.1", > "Solaris 10 x86_64", and "Solaris 10 sparc" with "Sun C 5.12" and > gcc-4.8.0 in 64-bit mode. Unfortunately "make" breaks with the same > error for both compilers on both Solaris platforms. > > > tyr openmpi-1.7.3a1r29213-SunOS.sparc.64_cc 126 tail -10 \ > log.make.SunOS.sparc.64_cc > Making all in mca/if/posix_ipv4 > make[2]: Entering directory `.../opal/mca/if/posix_ipv4' > CC if_posix.lo > "../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c", > line 277: undefined struct/union member: ifr_mtu > cc: acomp failed for > ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c > make[2]: *** [if_posix.lo] Error 1 > make[2]: Leaving directory `.../opal/mca/if/posix_ipv4' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `.../opal' > make: *** [all-recursive] Error 1 > > > tyr openmpi-1.7.3a1r29213-SunOS.sparc.64_gcc 131 tail -12 \ > log.make.SunOS.sparc.64_gcc > Making all in mca/if/posix_ipv4 > make[2]: Entering directory `.../opal/mca/if/posix_ipv4' > CC if_posix.lo > ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c: > In function 'if_posix_open': > ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c: > 277:31: error: 'struct ifreq' has no member named 'ifr_mtu' > intf->if_mtu = ifr->ifr_mtu; > ^ > make[2]: *** [if_posix.lo] Error 1 > make[2]: Leaving directory `.../opal/mca/if/posix_ipv4' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `.../opal' > make: *** [all-recursive] Error 1 > > > > I have had this problem before and Jeff solved it. Here is my > old e-mail. > > Date: Tue, 7 May 2013 19:38:11 +0200 (CEST) > From: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> > Subject: Re: commit/ompi-java: jsquyres: Up to SVN r28392 > To: jsquy...@cisco.com > Cc: siegmar.gr...@informatik.hs-fulda.de > MIME-Version: 1.0 > Content-MD5: O1pjPK/1JiMXXZ/EHyMU0Q== > X-HRZ-JLUG-MailScanner-Information: Passed JLUG virus check > X-HRZ-JLUG-MailScanner: No virus found > X-Envelope-From: fd1...@tyr.informatik.hs-fulda.de > X-Spam-Status: No > > Hello Jeff > >> Ok, I made a change in the OMPI trunk that should fix this: >> >> https://svn.open-mpi.org/trac/ompi/changeset/28460 >> >> And I pulled it into the ompi-java hg repo. Could you give >> it a whirl and let me know if this works for you? > > Perfect :-)))). Now I can build Open MPI on Solaris without > "#if 0" :-). Thank you very much for your help. > > > "make check" still produces the old bus error on Solaris Sparc. > All checks are fine on Linux and Solaris x86_64. > > ... > PASS: ddt_test > /bin/bash: line 5: 12453 Bus Error ${dir}$tst > FAIL: ddt_raw > ======================================================== > 1 of 5 tests failed > Please report to http://www.open-mpi.org/community/help/ > ======================================================== > make[3]: *** [check-TESTS] Error 1 > ... > > > Kind regards > > Siegmar > > >> On May 6, 2013, at 7:20 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: >> >>> Hello Jeff >>> >>>>>> "../../../../../ompi-java/opal/mca/if/posix_ipv4/if_posix.c", >>>>>> line 279: undefined struct/union member: ifr_mtu >>>>>> >>>>>> Sigh. Solaris kills me. :-\ >>>>>> >>>>>> Just so I understand -- Solaris has SIOCGIFMTU, but doesn't >>>>>> have struct ifreq.ifr_mtu? >>>>> >>>>> I found SIOCGIFMTU in sys/sockio.h with the following comment. >>>> >>>> Is there a Solaris-defined constant we can use here to know >>>> that we're on Solaris? If so, I can effectively make that code >>>> only be there if SIOCFIGMTU exists and we're not on Solaris. >>> >>> I searched our header files for "sunos" and "solaris" with >>> "-ignore-case", but didn't find anything useful. You have a very >>> minimal environment, if you use "sh" and you would have a useful >>> environment variable, if you use "tcsh". >>> >>> tyr java 321 su - >>> ... >>> # env >>> HOME=/root >>> HZ= >>> LANG=C >>> LC_ALL=C >>> LOGNAME=root >>> MAIL=/var/mail/root >>> PATH=/usr/sbin:/usr/bin >>> SHELL=/sbin/sh >>> TERM=dtterm >>> TZ=Europe/Berlin >>> # tcsh >>> # env | grep TYPE >>> HOSTTYPE=sun4 >>> OSTYPE=solaris >>> MACHTYPE=sparc >>> # >>> >>> The best solution would be "uname -s", if that is possible. >>> >>> # /usr/bin/uname -s >>> SunOS > > > I would be grateful, if somebody can solve the problem once more. > > > Kind regards > > Siegmar > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ------------------------------ Subject: Digest Footer _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ------------------------------ End of users Digest, Vol 2685, Issue 2 **************************************