Jeff Squyres <jsquy...@cisco.com> writes: > On Oct 11, 2008, at 6:48 AM, Aleksej Saushev wrote: > >> The actual message states: >> >> [asau.local:25752] [NO-NAME] ORTE_ERROR_LOG: Not found in file >> runtime/orte_init_stage1.c at line 182 >> -------------------------------------------------------------------------- > > Hmm. Even with all your output, I still don't see what could be > causing this -- the oob rml plugin was compiled and installed > just fine. Do you see an oob rml line in the output of > ompi_info?
$ ompi_info | grep oob [asau.local:00985] mca: base: components_open: Looking for ras components [asau.local:00985] mca: base: components_open: distilling ras components [asau.local:00985] mca: base: components_open: accepting all ras components [asau.local:00985] mca: base: components_open: opening ras components [asau.local:00985] mca: base: components_open: found loaded component dash_host [asau.local:00985] mca: base: components_open: component dash_host open function successful [asau.local:00985] mca: base: components_open: found loaded component gridengine [asau.local:00985] mca: base: components_open: component gridengine open function successful [asau.local:00985] mca: base: components_open: found loaded component localhost [asau.local:00985] mca: base: components_open: component localhost open function successful MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7) > Is there a chance that there's some dependent library of oob_rml > that is available on your head/build node, but not available on > your back- > end nodes? (that would be pretty odd, though) Very unlikely. Unless you don't install it at "make install" time, it is there. Host and target are the same (identical). Any particular library (set of libraries) to check? > Bummer -- it looks like we have a bug in the debugging output > for when rml plugins are selected -- so I can't just give you > an mpirun command line that will output some additional > diagnostic information. Do you mind getting your hands dirty > in a little code? If so, edit this file: > orte/mca/rml/base/rml_base_select.c and change all instances of > > opal_output_verbose(xxx, orte_rml_base.rml_output, ...) > to > opaL_output(orte_rml_base.rml_output, ...) > > And then compile/install that with (this is a shortcut; of > course, you can do a top-level "make install" to install it, > but it's a bit overkill for what we need for this bit): > > cd orte/rml > make > cd ../.. > make install-am > > Then run with: > > mpirun --mca rml_base_debug 100 ... > > And see what the output tells you. When I do this with a > successful run, my output looks like this: > > ---- > [5:38] svbu-mpi:~/mpi % mpirun -np 1 --mca rml_base_debug 100 hello > [svbu-mpi.cisco.com:02087] orte_rml_base_select: initializing > rml component oob > [svbu-mpi030:10587] orte_rml_base_select: initializing rml component oob > stdout: Hello, world! I am 0 of 1 (svbu-mpi030) > stderr: Hello, world! I am 0 of 1 (svbu-mpi030) > [5:39] svbu-mpi:~/mpi % > ----- > > (my "hello" program simply prints out the hello world message on > both stdout/stderr) $ mpirun --mca rml_base_debug 100 -np 2 skosfile [asau.local:09060] mca: base: components_open: Looking for rml components [asau.local:09060] mca: base: components_open: distilling rml components [asau.local:09060] mca: base: components_open: accepting all rml components [asau.local:09060] mca: base: components_open: opening rml components [asau.local:09060] mca: base: components_open: found loaded component oob [asau.local:09060] mca: base: components_open: component oob open function successful [asau.local:09060] orte_rml_base_select: initializing rml component oob [asau.local:09060] orte_rml_base_select: init returned failure [asau.local:09060] orte_rml_base_select: module oob unloaded [asau.local:09060] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 182 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_rml_base_select failed --> Returned value -13 instead of ORTE_SUCCESS -------------------------------------------------------------------------- [asau.local:09060] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42 [asau.local:09060] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52 -------------------------------------------------------------------------- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS. -------------------------------------------------------------------------- >> Additional information. >> >> pkgsrc framework does work correctly here, it even catches or >> overrides some incompatibilities, when building OpenMPI from the >> same tarball without pkgsrc framework, I get this: >> >> libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../../opal/ >> include -I../../../../orte/include -I../../../../ompi/include >> - >> I../../../.. -O3 -DNDEBUG -finline-functions >> -fno-strict-aliasing - >> pthread -MT backtrace_none_component.lo -MD -MP -MF .deps/ >> backtrace_none_component.Tpo -c backtrace_none_component.c >> -fPIC - >> DPIC -o .libs/backtrace_none_component.o >> backtrace_none_component.c:41: error: expected expression >> before ',' token >> backtrace_none_component.c:51: warning: braces around scalar >> initializer >> backtrace_none_component.c:51: warning: (near initialization >> for 'mca_backtrace_none_component >> .backtracec_version.mca_component_release_version') > > That's also odd. I don't see any problems in the source code in > this particular area. What is the output of this area of the > code when compiled with -E? It should show some obvious > problem. I'll check this a bit later, if you don't object. -- HE CE3OH...