Okay.  I’ll keep my responses limited to the other thread.

Thanks,

Karl


On Dec 3, 2013, at 9:54 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:

> Ok, I think we're chasing the same thing in multiple threads -- this looks 
> like a similar result as what you replied to Ralph with.
>
> Let's keep the other thread (with Ralph) going; this looks like some kind of 
> networking issue that we haven't seen before (e.g., unable to open ports to 
> the local host).  Which is a little odd, but let's run it down over in the 
> other thread.
>
>
> On Dec 3, 2013, at 7:44 AM, "Meredith, Karl" <karl.mered...@fmglobal.com> 
> wrote:
>
>> Using the latest nightly snapshot (1.7.4) and only Apple compilers/tools (no 
>> macports), I configure/build with the following:
>>
>> ./configure --prefix=/opt/trunk/apple-only-1.7.4 --enable-shared 
>> --disable-static --enable-debug --disable-io-romio 
>> --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default
>> make all
>> make install
>> export PATH=/opt/trunk/apple-only-1.7.4/bin/:$PATH
>> export LD_LIBRARY_PATH=/opt/trunk/apple-only-1.7.4/lib:$LD_LIBRARY_PATH
>> export DYLD_LIBRARY_PATH=/opt/trunk/apple-only-1.7.4/lib:$DYLD_LIBRARY_PATH
>> cd examples
>> make all
>> mpirun -v -np 2 ./hello_cxx
>>
>> Here’s the stack trace for one of the hanging process:
>>
>> (lldb) bt
>> * thread #1: tid = 0x57052, 0x00007fff8c991a3a 
>> libsystem_kernel.dylib`__semwait_signal + 10, queue = 
>> 'com.apple.main-thread, stop reason = signal SIGSTOP
>>   frame #0: 0x00007fff8c991a3a libsystem_kernel.dylib`__semwait_signal + 10
>>   frame #1: 0x00007fff8ade4e60 libsystem_c.dylib`nanosleep + 200
>>   frame #2: 0x0000000100be98e3 
>> libopen-rte.6.dylib`orte_routed_base_register_sync(setup=true) + 2435 at 
>> routed_base_fns.c:344
>>   frame #3: 0x0000000100ecc3a7 
>> mca_routed_binomial.so`init_routes(job=1305542657, ndat=0x0000000000000000) 
>> + 2759 at routed_binomial.c:708
>>   frame #4: 0x0000000100b9e84d 
>> libopen-rte.6.dylib`orte_ess_base_app_setup(db_restrict_local=true) + 2109 
>> at ess_base_std_app.c:233
>>   frame #5: 0x0000000100e3a442 mca_ess_env.so`rte_init + 418 at 
>> ess_env_module.c:146
>>   frame #6: 0x0000000100b59cfe 
>> libopen-rte.6.dylib`orte_init(pargc=0x0000000000000000, 
>> pargv=0x0000000000000000, flags=32) + 718 at orte_init.c:158
>>   frame #7: 0x00000001008bd3c8 libmpi.1.dylib`ompi_mpi_init(argc=0, 
>> argv=0x0000000000000000, requested=0, provided=0x00007fff5f3cd370) + 616 at 
>> ompi_mpi_init.c:451
>>   frame #8: 0x000000010090b5c3 
>> libmpi.1.dylib`MPI_Init(argc=0x0000000000000000, argv=0x0000000000000000) + 
>> 515 at init.c:86
>>   frame #9: 0x0000000100833a1d hello_cxx`MPI::Init() + 29 at 
>> functions_inln.h:128
>>   frame #10: 0x00000001008332ac hello_cxx`main(argc=1, 
>> argv=0x00007fff5f3cd550) + 44 at hello_cxx.cc:18
>>   frame #11: 0x00007fff8d5df5fd libdyld.dylib`start + 1
>>
>> Karl
>>
>>
>> On Dec 2, 2013, at 2:33 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>> wrote:
>>
>>> Ah -- sorry, I missed this mail before I replied to the other thread (OS X 
>>> Mail threaded them separately somehow...).
>>>
>>> Sorry to ask you to dive deeper, but can you find out where in 
>>> orte_ess.init() it's failing?  orte_ess.init is actually a function 
>>> pointer; it's a jump-off point into a dlopen'ed plugin.
>>>
>>>
>>> On Nov 25, 2013, at 11:53 AM, "Meredith, Karl" <karl.mered...@fmglobal.com> 
>>> wrote:
>>>
>>>> Digging a little deeper by running the code in the lldb debugger, I found 
>>>> that the stall occurs in a call to init_orte from ompi_mpi_init.c:
>>>> 356     /* Setup ORTE - note that we are an MPI process  */
>>>> 357     if (ORTE_SUCCESS != (ret = orte_init(NULL, NULL, ORTE_PROC_MPI))) {
>>>> 358         error = "ompi_mpi_init: orte_init failed";
>>>> 359         goto error;
>>>> 360     }
>>>>
>>>> The code never returns from orte_init.
>>>>
>>>> It gets stuck in orte_ess.init() called from orte_init.c:
>>>> 126     /* initialize the RTE for this environment */
>>>> 127     if (ORTE_SUCCESS != (ret = orte_ess.init())) {
>>>>
>>>> When I step through this orte_ess_init in the lldb debugger, I actually 
>>>> get some output from the code (no output if not using the debugger and 
>>>> stepping through):
>>>> --------------------------------------------------------------------------
>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>> likely to abort.  There are many reasons that a parallel process can
>>>> fail during MPI_INIT; some of which are due to configuration or environment
>>>> problems.  This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open MPI
>>>> developer):
>>>>
>>>> ompi_mpi_init: orte_init failed
>>>> --> Returned "Unable to start a daemon on the local node" (-128) instead 
>>>> of "Success" (0)
>>>>
>>>>
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Nov 25, 2013, at 9:20 AM, Meredith, Karl <karl.mered...@fmglobal.com> 
>>>> wrote:
>>>>
>>>>> Here’s the back trace from lldb:
>>>>> $ )ps -elf | grep  hello
>>>>> 1042653210 45231 45230     4006   0  31  0  2448976   2148 -      S+      
>>>>>             0 ttys002    0:00.01 hello_cxx         9:07AM
>>>>> 1042653210 45232 45230     4006   0  31  0  2457168   2156 -      S+      
>>>>>             0 ttys002    0:00.04 hello_cxx         9:07AM
>>>>>
>>>>> (meredithk@meredithk-mac)-(09:15 AM Mon Nov 
>>>>> 25)-(~/tools/openmpi-1.6.5/examples)
>>>>> $ )lldb -p 45231
>>>>> Attaching to process with:
>>>>> process attach -p 45231
>>>>> Process 45231 stopped
>>>>> Executable module set to 
>>>>> "/Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx".
>>>>> Architecture set to: x86_64-apple-macosx.
>>>>> (lldb) bt
>>>>> * thread #1: tid = 0x168535, 0x00007fff8c1859aa 
>>>>> libsystem_kernel.dylib`select$DARWIN_EXTSN + 10, queue = 
>>>>> 'com.apple.main-thread, stop reason = signal SIGSTOP
>>>>> frame #0: 0x00007fff8c1859aa libsystem_kernel.dylib`select$DARWIN_EXTSN + 
>>>>> 10
>>>>> frame #1: 0x0000000106b73ea0 
>>>>> libmpi.1.dylib`select_dispatch(base=0x00007f84c3c0b430, 
>>>>> arg=0x00007f84c3c0b3e0, tv=0x00007fff5924ca70) + 80 at select.c:174
>>>>> frame #2: 0x0000000106b3eb0f 
>>>>> libmpi.1.dylib`opal_event_base_loop(base=0x00007f84c3c0b430, flags=5) + 
>>>>> 415 at event.c:838
>>>>>
>>>>> Both processors are at this state.
>>>>>
>>>>> Here’s the output from otool -L ./hello_cxx:
>>>>>
>>>>> $ )otool -L ./hello_cxx
>>>>> ./hello_cxx:
>>>>>   /Users/meredithk/tools/openmpi/lib/libmpi_cxx.1.dylib (compatibility 
>>>>> version 2.0.0, current version 2.2.0)
>>>>>   /Users/meredithk/tools/openmpi/lib/libmpi.1.dylib (compatibility 
>>>>> version 2.0.0, current version 2.8.0)
>>>>>   /opt/local/lib/libgcc/libstdc++.6.dylib (compatibility version 7.0.0, 
>>>>> current version 7.18.0)
>>>>>   /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
>>>>> version 1197.1.1)
>>>>>   /opt/local/lib/libgcc/libgcc_s.1.dylib (compatibility version 1.0.0, 
>>>>> current version 1.0.0)
>>>>>
>>>>>
>>>>> On Nov 25, 2013, at 9:14 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>>>
>>>>>> Mac OS X 1.9 dropped support for gdb. Please report the output of lldb 
>>>>>> instead.
>>>>>>
>>>>>> Also, can you run “otool -L ./hello_cxx” and report the output.
>>>>>>
>>>>>> Thanks,
>>>>>> George.
>>>>>>
>>>>>>
>>>>>> On Nov 25, 2013, at 15:09 , Meredith, Karl <karl.mered...@fmglobal.com> 
>>>>>> wrote:
>>>>>>
>>>>>>> I do have DYLD_LIBRARY_PATH set to the same paths as LD_LIBRARY_PATH.  
>>>>>>> This does not resolve the problem.  The code still hangs on MPI::Init().
>>>>>>>
>>>>>>> Another thing I tried is I recompiled openmpi with the debug flags 
>>>>>>> activated:
>>>>>>> ./configure --prefix=$HOME/tools/openmpi --enable-debug
>>>>>>> make
>>>>>>> make install
>>>>>>>
>>>>>>> Then, I attached to the running process using gdb.  I tried to do a 
>>>>>>> back trace and see where it was hanging up at, but all I got was this:
>>>>>>> Attaching to process 45231
>>>>>>> Reading symbols from 
>>>>>>> /Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx...Reading 
>>>>>>> symbols from 
>>>>>>> /Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx.dSYM/Contents/Resources/DWARF/hello_cxx...done.
>>>>>>> done.
>>>>>>> 0x00007fff8c1859aa in ?? ()
>>>>>>> (gdb) bt
>>>>>>> #0  0x00007fff8c1859aa in ?? ()
>>>>>>> #1  0x0000000106b73ea0 in ?? ()
>>>>>>> #2  0x706d6e65706f2f2f in ?? ()
>>>>>>> #3  0x0000000000000001 in ?? ()
>>>>>>> #4  0x0000000000000000 in ?? ()
>>>>>>>
>>>>>>> This output from gdb was not terribly helpful to me.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Nov 25, 2013, at 8:30 AM, Hammond, Simon David (-EXP) 
>>>>>>> <sdha...@sandia.gov<mailto:sdha...@sandia.gov>> wrote:
>>>>>>>
>>>>>>> We have occasionally had a problem like this when we set 
>>>>>>> LD_LIBRARY_PATH only. On OSX you may need to set DYLD_LIBRARY_PATH 
>>>>>>> instead ( set it to the same lib directory )
>>>>>>>
>>>>>>> Can you try that and see if it resolves the problem?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Si Hammond
>>>>>>> Sandia National Laboratories
>>>>>>> Remote Connection
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Meredith, Karl 
>>>>>>> [karl.mered...@fmglobal.com<mailto:karl.mered...@fmglobal.com>]
>>>>>>> Sent: Monday, November 25, 2013 06:25 AM Mountain Standard Time
>>>>>>> To: Open MPI Users
>>>>>>> Subject: [EXTERNAL] Re: [OMPI users] open-mpi on Mac OS 10.9 (Mavericks)
>>>>>>>
>>>>>>>
>>>>>>> I do have these two environment variables set:
>>>>>>>
>>>>>>> LD_LIBRARY_PATH=/Users/meredithk/tools/openmpi/lib
>>>>>>> PATH=/Users/meredithk/tools/openmpi/bin
>>>>>>>
>>>>>>> Running mpirun seems to work fine with a simple command, like hostname:
>>>>>>>
>>>>>>> $ )mpirun -n 2 hostname
>>>>>>> meredithk-mac.corp.fmglobal.com<http://meredithk-mac.corp.fmglobal.com>
>>>>>>> meredithk-mac.corp.fmglobal.com<http://meredithk-mac.corp.fmglobal.com>
>>>>>>>
>>>>>>> I am trying to run the simple hello_cxx example from the openmpi 
>>>>>>> distribution, compiled as such:
>>>>>>> mpic++ -g    hello_cxx.cc   -o hello_cxx
>>>>>>>
>>>>>>> It compiles fine, without warning or error.  However, when I go to run 
>>>>>>> the example, it stalls on the MPI::Init() command:
>>>>>>> mpirun -np 1 hello_cxx
>>>>>>> It never errors out or crashes.  It simply hangs.
>>>>>>>
>>>>>>> I am using the same mpic++ and mpirun version:
>>>>>>> $ )which mpirun
>>>>>>> /Users/meredithk/tools/openmpi/bin/mpirun
>>>>>>>
>>>>>>> $ )which mpic++
>>>>>>> /Users/meredithk/tools/openmpi/bin/mpic++
>>>>>>>
>>>>>>> Not quite sure what else to check.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Nov 23, 2013, at 5:29 PM, Ralph Castain 
>>>>>>> <r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:
>>>>>>>
>>>>>>>> Strange - I run on Mavericks now without problem. Can you run "mpirun 
>>>>>>>> -n 1 hostname"?
>>>>>>>>
>>>>>>>> You also might want to check your PATH and LD_LIBRARY_PATH to ensure 
>>>>>>>> you have the prefix where you installed OMPI 1.6.5 at the front. Mac 
>>>>>>>> distributes a very old version of OMPI with its software and you don't 
>>>>>>>> want to pick it up by mistake.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 22, 2013, at 1:45 PM, Meredith, Karl 
>>>>>>>> <karl.mered...@fmglobal.com<mailto:karl.mered...@fmglobal.com>> wrote:
>>>>>>>>
>>>>>>>>> I recently upgraded my 2013 Macbook Pro (Retina display) from 10.8 to 
>>>>>>>>> 10.9.  I downloaded and installed openmpi-1.6.5 and compiled it with 
>>>>>>>>> gcc 4.8 (gcc installed from macports).
>>>>>>>>> openmpi compiled and installed without error.
>>>>>>>>>
>>>>>>>>> However, when I try to run any of the example test cases, the code 
>>>>>>>>> gets stuck inside the first MPI::Init() call and never returns.
>>>>>>>>>
>>>>>>>>> Any thoughts on what might be going wrong?
>>>>>>>>>
>>>>>>>>> The same install on OS 10.8 works fine and the example test cases run 
>>>>>>>>> without error.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to