On Oct 9, 2007, at 3:50 PM, Dirk Eddelbuettel wrote:
edd@ron:~$ orterun -n 2 --mca mca_component_show_load_errors 1 r -e
'library(Rmpi); print(mpi.comm.rank(0))'
[ron:18360] mca: base: component_find: unable to open osc pt2pt:
file not found (ignored)
[ron:18361] mca: base: component_find: unable to open osc pt2pt:
file not found (ignored)
Truly odd. Looking in the code, this error message is displayed when
lt_dlopen() of the component fails for some reason (the Libtool
portable wrapper library around dlopen() and friends). We print out
the error string that libltdl returns to us, and it's apparently
"file not found". This *usually* refers to the fact that a
dependency of the DSO that we're trying to open wasn't found (not
that the DSO itself wasn't found).
Your list of ldd dependencies didn't show anything odd, so I can't
imagine why it would get a "file not found" kind of error.
An off the wall question: are you compiling / building Open MPI on
one system and running it on another, where perhaps the dependencies
are slightly different and therefore causing a failure? This is a
pretty weak question to ask, because I assume that *many* OMPI
components would fail to open if this were the case, but I thought
I'd ask anyway...
Another whacky question: does the error happen when you start your
test program manually (without mpirun)?
Does this happen for all MPI programs (potentially only those that
use the MPI-2 one-sided stuff), or just your R environment?
At this point, all I can suggest is firing up a debugger and stepping
through the code in ld_dlopenext() to see why exactly it is failing.
Since we call lt_dlopenext() many, many times (and you're only
interested in when we call it for the osc pt2pt component), I'd
suggest something like the following:
- it is easier if the problem also occurs when you run the program
serially (without mpirun) -- just run it in a debugger
- break in ompi_osc_base_open
- set a breakpoint for lt_dlopenext
- continue until you hit the lt_dlopenext function
- print the filename; it will be either the pt2pt or rdma osc components
- step through the lt_dlopenext function and see if you can track
down the exact error
Sorry I don't have a better suggestion than this... :-\
--
Jeff Squyres
Cisco Systems