Okay. I’ll keep my responses limited to the other thread. Thanks,
Karl On Dec 3, 2013, at 9:54 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Ok, I think we're chasing the same thing in multiple threads -- this looks > like a similar result as what you replied to Ralph with. > > Let's keep the other thread (with Ralph) going; this looks like some kind of > networking issue that we haven't seen before (e.g., unable to open ports to > the local host). Which is a little odd, but let's run it down over in the > other thread. > > > On Dec 3, 2013, at 7:44 AM, "Meredith, Karl" <karl.mered...@fmglobal.com> > wrote: > >> Using the latest nightly snapshot (1.7.4) and only Apple compilers/tools (no >> macports), I configure/build with the following: >> >> ./configure --prefix=/opt/trunk/apple-only-1.7.4 --enable-shared >> --disable-static --enable-debug --disable-io-romio >> --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default >> make all >> make install >> export PATH=/opt/trunk/apple-only-1.7.4/bin/:$PATH >> export LD_LIBRARY_PATH=/opt/trunk/apple-only-1.7.4/lib:$LD_LIBRARY_PATH >> export DYLD_LIBRARY_PATH=/opt/trunk/apple-only-1.7.4/lib:$DYLD_LIBRARY_PATH >> cd examples >> make all >> mpirun -v -np 2 ./hello_cxx >> >> Here’s the stack trace for one of the hanging process: >> >> (lldb) bt >> * thread #1: tid = 0x57052, 0x00007fff8c991a3a >> libsystem_kernel.dylib`__semwait_signal + 10, queue = >> 'com.apple.main-thread, stop reason = signal SIGSTOP >> frame #0: 0x00007fff8c991a3a libsystem_kernel.dylib`__semwait_signal + 10 >> frame #1: 0x00007fff8ade4e60 libsystem_c.dylib`nanosleep + 200 >> frame #2: 0x0000000100be98e3 >> libopen-rte.6.dylib`orte_routed_base_register_sync(setup=true) + 2435 at >> routed_base_fns.c:344 >> frame #3: 0x0000000100ecc3a7 >> mca_routed_binomial.so`init_routes(job=1305542657, ndat=0x0000000000000000) >> + 2759 at routed_binomial.c:708 >> frame #4: 0x0000000100b9e84d >> libopen-rte.6.dylib`orte_ess_base_app_setup(db_restrict_local=true) + 2109 >> at ess_base_std_app.c:233 >> frame #5: 0x0000000100e3a442 mca_ess_env.so`rte_init + 418 at >> ess_env_module.c:146 >> frame #6: 0x0000000100b59cfe >> libopen-rte.6.dylib`orte_init(pargc=0x0000000000000000, >> pargv=0x0000000000000000, flags=32) + 718 at orte_init.c:158 >> frame #7: 0x00000001008bd3c8 libmpi.1.dylib`ompi_mpi_init(argc=0, >> argv=0x0000000000000000, requested=0, provided=0x00007fff5f3cd370) + 616 at >> ompi_mpi_init.c:451 >> frame #8: 0x000000010090b5c3 >> libmpi.1.dylib`MPI_Init(argc=0x0000000000000000, argv=0x0000000000000000) + >> 515 at init.c:86 >> frame #9: 0x0000000100833a1d hello_cxx`MPI::Init() + 29 at >> functions_inln.h:128 >> frame #10: 0x00000001008332ac hello_cxx`main(argc=1, >> argv=0x00007fff5f3cd550) + 44 at hello_cxx.cc:18 >> frame #11: 0x00007fff8d5df5fd libdyld.dylib`start + 1 >> >> Karl >> >> >> On Dec 2, 2013, at 2:33 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >> wrote: >> >>> Ah -- sorry, I missed this mail before I replied to the other thread (OS X >>> Mail threaded them separately somehow...). >>> >>> Sorry to ask you to dive deeper, but can you find out where in >>> orte_ess.init() it's failing? orte_ess.init is actually a function >>> pointer; it's a jump-off point into a dlopen'ed plugin. >>> >>> >>> On Nov 25, 2013, at 11:53 AM, "Meredith, Karl" <karl.mered...@fmglobal.com> >>> wrote: >>> >>>> Digging a little deeper by running the code in the lldb debugger, I found >>>> that the stall occurs in a call to init_orte from ompi_mpi_init.c: >>>> 356 /* Setup ORTE - note that we are an MPI process */ >>>> 357 if (ORTE_SUCCESS != (ret = orte_init(NULL, NULL, ORTE_PROC_MPI))) { >>>> 358 error = "ompi_mpi_init: orte_init failed"; >>>> 359 goto error; >>>> 360 } >>>> >>>> The code never returns from orte_init. >>>> >>>> It gets stuck in orte_ess.init() called from orte_init.c: >>>> 126 /* initialize the RTE for this environment */ >>>> 127 if (ORTE_SUCCESS != (ret = orte_ess.init())) { >>>> >>>> When I step through this orte_ess_init in the lldb debugger, I actually >>>> get some output from the code (no output if not using the debugger and >>>> stepping through): >>>> -------------------------------------------------------------------------- >>>> It looks like MPI_INIT failed for some reason; your parallel process is >>>> likely to abort. There are many reasons that a parallel process can >>>> fail during MPI_INIT; some of which are due to configuration or environment >>>> problems. This failure appears to be an internal failure; here's some >>>> additional information (which may only be relevant to an Open MPI >>>> developer): >>>> >>>> ompi_mpi_init: orte_init failed >>>> --> Returned "Unable to start a daemon on the local node" (-128) instead >>>> of "Success" (0) >>>> >>>> >>>> >>>> Karl >>>> >>>> >>>> >>>> On Nov 25, 2013, at 9:20 AM, Meredith, Karl <karl.mered...@fmglobal.com> >>>> wrote: >>>> >>>>> Here’s the back trace from lldb: >>>>> $ )ps -elf | grep hello >>>>> 1042653210 45231 45230 4006 0 31 0 2448976 2148 - S+ >>>>> 0 ttys002 0:00.01 hello_cxx 9:07AM >>>>> 1042653210 45232 45230 4006 0 31 0 2457168 2156 - S+ >>>>> 0 ttys002 0:00.04 hello_cxx 9:07AM >>>>> >>>>> (meredithk@meredithk-mac)-(09:15 AM Mon Nov >>>>> 25)-(~/tools/openmpi-1.6.5/examples) >>>>> $ )lldb -p 45231 >>>>> Attaching to process with: >>>>> process attach -p 45231 >>>>> Process 45231 stopped >>>>> Executable module set to >>>>> "/Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx". >>>>> Architecture set to: x86_64-apple-macosx. >>>>> (lldb) bt >>>>> * thread #1: tid = 0x168535, 0x00007fff8c1859aa >>>>> libsystem_kernel.dylib`select$DARWIN_EXTSN + 10, queue = >>>>> 'com.apple.main-thread, stop reason = signal SIGSTOP >>>>> frame #0: 0x00007fff8c1859aa libsystem_kernel.dylib`select$DARWIN_EXTSN + >>>>> 10 >>>>> frame #1: 0x0000000106b73ea0 >>>>> libmpi.1.dylib`select_dispatch(base=0x00007f84c3c0b430, >>>>> arg=0x00007f84c3c0b3e0, tv=0x00007fff5924ca70) + 80 at select.c:174 >>>>> frame #2: 0x0000000106b3eb0f >>>>> libmpi.1.dylib`opal_event_base_loop(base=0x00007f84c3c0b430, flags=5) + >>>>> 415 at event.c:838 >>>>> >>>>> Both processors are at this state. >>>>> >>>>> Here’s the output from otool -L ./hello_cxx: >>>>> >>>>> $ )otool -L ./hello_cxx >>>>> ./hello_cxx: >>>>> /Users/meredithk/tools/openmpi/lib/libmpi_cxx.1.dylib (compatibility >>>>> version 2.0.0, current version 2.2.0) >>>>> /Users/meredithk/tools/openmpi/lib/libmpi.1.dylib (compatibility >>>>> version 2.0.0, current version 2.8.0) >>>>> /opt/local/lib/libgcc/libstdc++.6.dylib (compatibility version 7.0.0, >>>>> current version 7.18.0) >>>>> /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current >>>>> version 1197.1.1) >>>>> /opt/local/lib/libgcc/libgcc_s.1.dylib (compatibility version 1.0.0, >>>>> current version 1.0.0) >>>>> >>>>> >>>>> On Nov 25, 2013, at 9:14 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>>> >>>>>> Mac OS X 1.9 dropped support for gdb. Please report the output of lldb >>>>>> instead. >>>>>> >>>>>> Also, can you run “otool -L ./hello_cxx” and report the output. >>>>>> >>>>>> Thanks, >>>>>> George. >>>>>> >>>>>> >>>>>> On Nov 25, 2013, at 15:09 , Meredith, Karl <karl.mered...@fmglobal.com> >>>>>> wrote: >>>>>> >>>>>>> I do have DYLD_LIBRARY_PATH set to the same paths as LD_LIBRARY_PATH. >>>>>>> This does not resolve the problem. The code still hangs on MPI::Init(). >>>>>>> >>>>>>> Another thing I tried is I recompiled openmpi with the debug flags >>>>>>> activated: >>>>>>> ./configure --prefix=$HOME/tools/openmpi --enable-debug >>>>>>> make >>>>>>> make install >>>>>>> >>>>>>> Then, I attached to the running process using gdb. I tried to do a >>>>>>> back trace and see where it was hanging up at, but all I got was this: >>>>>>> Attaching to process 45231 >>>>>>> Reading symbols from >>>>>>> /Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx...Reading >>>>>>> symbols from >>>>>>> /Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx.dSYM/Contents/Resources/DWARF/hello_cxx...done. >>>>>>> done. >>>>>>> 0x00007fff8c1859aa in ?? () >>>>>>> (gdb) bt >>>>>>> #0 0x00007fff8c1859aa in ?? () >>>>>>> #1 0x0000000106b73ea0 in ?? () >>>>>>> #2 0x706d6e65706f2f2f in ?? () >>>>>>> #3 0x0000000000000001 in ?? () >>>>>>> #4 0x0000000000000000 in ?? () >>>>>>> >>>>>>> This output from gdb was not terribly helpful to me. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Nov 25, 2013, at 8:30 AM, Hammond, Simon David (-EXP) >>>>>>> <sdha...@sandia.gov<mailto:sdha...@sandia.gov>> wrote: >>>>>>> >>>>>>> We have occasionally had a problem like this when we set >>>>>>> LD_LIBRARY_PATH only. On OSX you may need to set DYLD_LIBRARY_PATH >>>>>>> instead ( set it to the same lib directory ) >>>>>>> >>>>>>> Can you try that and see if it resolves the problem? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Si Hammond >>>>>>> Sandia National Laboratories >>>>>>> Remote Connection >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Meredith, Karl >>>>>>> [karl.mered...@fmglobal.com<mailto:karl.mered...@fmglobal.com>] >>>>>>> Sent: Monday, November 25, 2013 06:25 AM Mountain Standard Time >>>>>>> To: Open MPI Users >>>>>>> Subject: [EXTERNAL] Re: [OMPI users] open-mpi on Mac OS 10.9 (Mavericks) >>>>>>> >>>>>>> >>>>>>> I do have these two environment variables set: >>>>>>> >>>>>>> LD_LIBRARY_PATH=/Users/meredithk/tools/openmpi/lib >>>>>>> PATH=/Users/meredithk/tools/openmpi/bin >>>>>>> >>>>>>> Running mpirun seems to work fine with a simple command, like hostname: >>>>>>> >>>>>>> $ )mpirun -n 2 hostname >>>>>>> meredithk-mac.corp.fmglobal.com<http://meredithk-mac.corp.fmglobal.com> >>>>>>> meredithk-mac.corp.fmglobal.com<http://meredithk-mac.corp.fmglobal.com> >>>>>>> >>>>>>> I am trying to run the simple hello_cxx example from the openmpi >>>>>>> distribution, compiled as such: >>>>>>> mpic++ -g hello_cxx.cc -o hello_cxx >>>>>>> >>>>>>> It compiles fine, without warning or error. However, when I go to run >>>>>>> the example, it stalls on the MPI::Init() command: >>>>>>> mpirun -np 1 hello_cxx >>>>>>> It never errors out or crashes. It simply hangs. >>>>>>> >>>>>>> I am using the same mpic++ and mpirun version: >>>>>>> $ )which mpirun >>>>>>> /Users/meredithk/tools/openmpi/bin/mpirun >>>>>>> >>>>>>> $ )which mpic++ >>>>>>> /Users/meredithk/tools/openmpi/bin/mpic++ >>>>>>> >>>>>>> Not quite sure what else to check. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Nov 23, 2013, at 5:29 PM, Ralph Castain >>>>>>> <r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote: >>>>>>> >>>>>>>> Strange - I run on Mavericks now without problem. Can you run "mpirun >>>>>>>> -n 1 hostname"? >>>>>>>> >>>>>>>> You also might want to check your PATH and LD_LIBRARY_PATH to ensure >>>>>>>> you have the prefix where you installed OMPI 1.6.5 at the front. Mac >>>>>>>> distributes a very old version of OMPI with its software and you don't >>>>>>>> want to pick it up by mistake. >>>>>>>> >>>>>>>> >>>>>>>> On Nov 22, 2013, at 1:45 PM, Meredith, Karl >>>>>>>> <karl.mered...@fmglobal.com<mailto:karl.mered...@fmglobal.com>> wrote: >>>>>>>> >>>>>>>>> I recently upgraded my 2013 Macbook Pro (Retina display) from 10.8 to >>>>>>>>> 10.9. I downloaded and installed openmpi-1.6.5 and compiled it with >>>>>>>>> gcc 4.8 (gcc installed from macports). >>>>>>>>> openmpi compiled and installed without error. >>>>>>>>> >>>>>>>>> However, when I try to run any of the example test cases, the code >>>>>>>>> gets stuck inside the first MPI::Init() call and never returns. >>>>>>>>> >>>>>>>>> Any thoughts on what might be going wrong? >>>>>>>>> >>>>>>>>> The same install on OS 10.8 works fine and the example test cases run >>>>>>>>> without error. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org<mailto:us...@open-mpi.org> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org<mailto:us...@open-mpi.org> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org<mailto:us...@open-mpi.org> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org<mailto:us...@open-mpi.org> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users