Using the latest nightly snapshot (1.7.4) and only Apple compilers/tools (no
macports), I configure/build with the following:
./configure --prefix=/opt/trunk/apple-only-1.7.4 --enable-shared
--disable-static --enable-debug --disable-io-romio
--enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default
make all
make install
export PATH=/opt/trunk/apple-only-1.7.4/bin/:$PATH
export LD_LIBRARY_PATH=/opt/trunk/apple-only-1.7.4/lib:$LD_LIBRARY_PATH
export DYLD_LIBRARY_PATH=/opt/trunk/apple-only-1.7.4/lib:$DYLD_LIBRARY_PATH
cd examples
make all
mpirun -v -np 2 ./hello_cxx
Here’s the stack trace for one of the hanging process:
(lldb) bt
* thread #1: tid = 0x57052, 0x00007fff8c991a3a
libsystem_kernel.dylib`__semwait_signal + 10, queue = 'com.apple.main-thread,
stop reason = signal SIGSTOP
frame #0: 0x00007fff8c991a3a libsystem_kernel.dylib`__semwait_signal + 10
frame #1: 0x00007fff8ade4e60 libsystem_c.dylib`nanosleep + 200
frame #2: 0x0000000100be98e3
libopen-rte.6.dylib`orte_routed_base_register_sync(setup=true) + 2435 at
routed_base_fns.c:344
frame #3: 0x0000000100ecc3a7
mca_routed_binomial.so`init_routes(job=1305542657, ndat=0x0000000000000000) +
2759 at routed_binomial.c:708
frame #4: 0x0000000100b9e84d
libopen-rte.6.dylib`orte_ess_base_app_setup(db_restrict_local=true) + 2109 at
ess_base_std_app.c:233
frame #5: 0x0000000100e3a442 mca_ess_env.so`rte_init + 418 at
ess_env_module.c:146
frame #6: 0x0000000100b59cfe
libopen-rte.6.dylib`orte_init(pargc=0x0000000000000000,
pargv=0x0000000000000000, flags=32) + 718 at orte_init.c:158
frame #7: 0x00000001008bd3c8 libmpi.1.dylib`ompi_mpi_init(argc=0,
argv=0x0000000000000000, requested=0, provided=0x00007fff5f3cd370) + 616 at
ompi_mpi_init.c:451
frame #8: 0x000000010090b5c3
libmpi.1.dylib`MPI_Init(argc=0x0000000000000000, argv=0x0000000000000000) + 515
at init.c:86
frame #9: 0x0000000100833a1d hello_cxx`MPI::Init() + 29 at
functions_inln.h:128
frame #10: 0x00000001008332ac hello_cxx`main(argc=1,
argv=0x00007fff5f3cd550) + 44 at hello_cxx.cc:18
frame #11: 0x00007fff8d5df5fd libdyld.dylib`start + 1
Karl
On Dec 2, 2013, at 2:33 PM, Jeff Squyres (jsquyres) <[email protected]> wrote:
> Ah -- sorry, I missed this mail before I replied to the other thread (OS X
> Mail threaded them separately somehow...).
>
> Sorry to ask you to dive deeper, but can you find out where in
> orte_ess.init() it's failing? orte_ess.init is actually a function pointer;
> it's a jump-off point into a dlopen'ed plugin.
>
>
> On Nov 25, 2013, at 11:53 AM, "Meredith, Karl" <[email protected]>
> wrote:
>
>> Digging a little deeper by running the code in the lldb debugger, I found
>> that the stall occurs in a call to init_orte from ompi_mpi_init.c:
>> 356 /* Setup ORTE - note that we are an MPI process */
>> 357 if (ORTE_SUCCESS != (ret = orte_init(NULL, NULL, ORTE_PROC_MPI))) {
>> 358 error = "ompi_mpi_init: orte_init failed";
>> 359 goto error;
>> 360 }
>>
>> The code never returns from orte_init.
>>
>> It gets stuck in orte_ess.init() called from orte_init.c:
>> 126 /* initialize the RTE for this environment */
>> 127 if (ORTE_SUCCESS != (ret = orte_ess.init())) {
>>
>> When I step through this orte_ess_init in the lldb debugger, I actually get
>> some output from the code (no output if not using the debugger and stepping
>> through):
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> ompi_mpi_init: orte_init failed
>> --> Returned "Unable to start a daemon on the local node" (-128) instead of
>> "Success" (0)
>>
>>
>>
>> Karl
>>
>>
>>
>> On Nov 25, 2013, at 9:20 AM, Meredith, Karl <[email protected]>
>> wrote:
>>
>>> Here’s the back trace from lldb:
>>> $ )ps -elf | grep hello
>>> 1042653210 45231 45230 4006 0 31 0 2448976 2148 - S+
>>> 0 ttys002 0:00.01 hello_cxx 9:07AM
>>> 1042653210 45232 45230 4006 0 31 0 2457168 2156 - S+
>>> 0 ttys002 0:00.04 hello_cxx 9:07AM
>>>
>>> (meredithk@meredithk-mac)-(09:15 AM Mon Nov
>>> 25)-(~/tools/openmpi-1.6.5/examples)
>>> $ )lldb -p 45231
>>> Attaching to process with:
>>> process attach -p 45231
>>> Process 45231 stopped
>>> Executable module set to
>>> "/Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx".
>>> Architecture set to: x86_64-apple-macosx.
>>> (lldb) bt
>>> * thread #1: tid = 0x168535, 0x00007fff8c1859aa
>>> libsystem_kernel.dylib`select$DARWIN_EXTSN + 10, queue =
>>> 'com.apple.main-thread, stop reason = signal SIGSTOP
>>> frame #0: 0x00007fff8c1859aa libsystem_kernel.dylib`select$DARWIN_EXTSN +
>>> 10
>>> frame #1: 0x0000000106b73ea0
>>> libmpi.1.dylib`select_dispatch(base=0x00007f84c3c0b430,
>>> arg=0x00007f84c3c0b3e0, tv=0x00007fff5924ca70) + 80 at select.c:174
>>> frame #2: 0x0000000106b3eb0f
>>> libmpi.1.dylib`opal_event_base_loop(base=0x00007f84c3c0b430, flags=5) + 415
>>> at event.c:838
>>>
>>> Both processors are at this state.
>>>
>>> Here’s the output from otool -L ./hello_cxx:
>>>
>>> $ )otool -L ./hello_cxx
>>> ./hello_cxx:
>>> /Users/meredithk/tools/openmpi/lib/libmpi_cxx.1.dylib (compatibility
>>> version 2.0.0, current version 2.2.0)
>>> /Users/meredithk/tools/openmpi/lib/libmpi.1.dylib (compatibility
>>> version 2.0.0, current version 2.8.0)
>>> /opt/local/lib/libgcc/libstdc++.6.dylib (compatibility version 7.0.0,
>>> current version 7.18.0)
>>> /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current
>>> version 1197.1.1)
>>> /opt/local/lib/libgcc/libgcc_s.1.dylib (compatibility version 1.0.0,
>>> current version 1.0.0)
>>>
>>>
>>> On Nov 25, 2013, at 9:14 AM, George Bosilca <[email protected]> wrote:
>>>
>>>> Mac OS X 1.9 dropped support for gdb. Please report the output of lldb
>>>> instead.
>>>>
>>>> Also, can you run “otool -L ./hello_cxx” and report the output.
>>>>
>>>> Thanks,
>>>> George.
>>>>
>>>>
>>>> On Nov 25, 2013, at 15:09 , Meredith, Karl <[email protected]>
>>>> wrote:
>>>>
>>>>> I do have DYLD_LIBRARY_PATH set to the same paths as LD_LIBRARY_PATH.
>>>>> This does not resolve the problem. The code still hangs on MPI::Init().
>>>>>
>>>>> Another thing I tried is I recompiled openmpi with the debug flags
>>>>> activated:
>>>>> ./configure --prefix=$HOME/tools/openmpi --enable-debug
>>>>> make
>>>>> make install
>>>>>
>>>>> Then, I attached to the running process using gdb. I tried to do a back
>>>>> trace and see where it was hanging up at, but all I got was this:
>>>>> Attaching to process 45231
>>>>> Reading symbols from
>>>>> /Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx...Reading symbols
>>>>> from
>>>>> /Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx.dSYM/Contents/Resources/DWARF/hello_cxx...done.
>>>>> done.
>>>>> 0x00007fff8c1859aa in ?? ()
>>>>> (gdb) bt
>>>>> #0 0x00007fff8c1859aa in ?? ()
>>>>> #1 0x0000000106b73ea0 in ?? ()
>>>>> #2 0x706d6e65706f2f2f in ?? ()
>>>>> #3 0x0000000000000001 in ?? ()
>>>>> #4 0x0000000000000000 in ?? ()
>>>>>
>>>>> This output from gdb was not terribly helpful to me.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Nov 25, 2013, at 8:30 AM, Hammond, Simon David (-EXP)
>>>>> <[email protected]<mailto:[email protected]>> wrote:
>>>>>
>>>>> We have occasionally had a problem like this when we set LD_LIBRARY_PATH
>>>>> only. On OSX you may need to set DYLD_LIBRARY_PATH instead ( set it to
>>>>> the same lib directory )
>>>>>
>>>>> Can you try that and see if it resolves the problem?
>>>>>
>>>>>
>>>>>
>>>>> Si Hammond
>>>>> Sandia National Laboratories
>>>>> Remote Connection
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Meredith, Karl
>>>>> [[email protected]<mailto:[email protected]>]
>>>>> Sent: Monday, November 25, 2013 06:25 AM Mountain Standard Time
>>>>> To: Open MPI Users
>>>>> Subject: [EXTERNAL] Re: [OMPI users] open-mpi on Mac OS 10.9 (Mavericks)
>>>>>
>>>>>
>>>>> I do have these two environment variables set:
>>>>>
>>>>> LD_LIBRARY_PATH=/Users/meredithk/tools/openmpi/lib
>>>>> PATH=/Users/meredithk/tools/openmpi/bin
>>>>>
>>>>> Running mpirun seems to work fine with a simple command, like hostname:
>>>>>
>>>>> $ )mpirun -n 2 hostname
>>>>> meredithk-mac.corp.fmglobal.com<http://meredithk-mac.corp.fmglobal.com>
>>>>> meredithk-mac.corp.fmglobal.com<http://meredithk-mac.corp.fmglobal.com>
>>>>>
>>>>> I am trying to run the simple hello_cxx example from the openmpi
>>>>> distribution, compiled as such:
>>>>> mpic++ -g hello_cxx.cc -o hello_cxx
>>>>>
>>>>> It compiles fine, without warning or error. However, when I go to run
>>>>> the example, it stalls on the MPI::Init() command:
>>>>> mpirun -np 1 hello_cxx
>>>>> It never errors out or crashes. It simply hangs.
>>>>>
>>>>> I am using the same mpic++ and mpirun version:
>>>>> $ )which mpirun
>>>>> /Users/meredithk/tools/openmpi/bin/mpirun
>>>>>
>>>>> $ )which mpic++
>>>>> /Users/meredithk/tools/openmpi/bin/mpic++
>>>>>
>>>>> Not quite sure what else to check.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Nov 23, 2013, at 5:29 PM, Ralph Castain
>>>>> <[email protected]<mailto:[email protected]>> wrote:
>>>>>
>>>>>> Strange - I run on Mavericks now without problem. Can you run "mpirun -n
>>>>>> 1 hostname"?
>>>>>>
>>>>>> You also might want to check your PATH and LD_LIBRARY_PATH to ensure you
>>>>>> have the prefix where you installed OMPI 1.6.5 at the front. Mac
>>>>>> distributes a very old version of OMPI with its software and you don't
>>>>>> want to pick it up by mistake.
>>>>>>
>>>>>>
>>>>>> On Nov 22, 2013, at 1:45 PM, Meredith, Karl
>>>>>> <[email protected]<mailto:[email protected]>> wrote:
>>>>>>
>>>>>>> I recently upgraded my 2013 Macbook Pro (Retina display) from 10.8 to
>>>>>>> 10.9. I downloaded and installed openmpi-1.6.5 and compiled it with
>>>>>>> gcc 4.8 (gcc installed from macports).
>>>>>>> openmpi compiled and installed without error.
>>>>>>>
>>>>>>> However, when I try to run any of the example test cases, the code gets
>>>>>>> stuck inside the first MPI::Init() call and never returns.
>>>>>>>
>>>>>>> Any thoughts on what might be going wrong?
>>>>>>>
>>>>>>> The same install on OS 10.8 works fine and the example test cases run
>>>>>>> without error.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> [email protected]<mailto:[email protected]>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> [email protected]<mailto:[email protected]>
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]<mailto:[email protected]>
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]<mailto:[email protected]>
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> [email protected]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/users