On Oct 9, 2007, at 3:50 PM, Dirk Eddelbuettel wrote:

edd@ron:~$ orterun -n 2 --mca mca_component_show_load_errors 1 r -e 'library(Rmpi); print(mpi.comm.rank(0))' [ron:18360] mca: base: component_find: unable to open osc pt2pt: file not found (ignored) [ron:18361] mca: base: component_find: unable to open osc pt2pt: file not found (ignored)

Truly odd. Looking in the code, this error message is displayed when lt_dlopen() of the component fails for some reason (the Libtool portable wrapper library around dlopen() and friends). We print out the error string that libltdl returns to us, and it's apparently "file not found". This *usually* refers to the fact that a dependency of the DSO that we're trying to open wasn't found (not that the DSO itself wasn't found).

Your list of ldd dependencies didn't show anything odd, so I can't imagine why it would get a "file not found" kind of error.

An off the wall question: are you compiling / building Open MPI on one system and running it on another, where perhaps the dependencies are slightly different and therefore causing a failure? This is a pretty weak question to ask, because I assume that *many* OMPI components would fail to open if this were the case, but I thought I'd ask anyway...

Another whacky question: does the error happen when you start your test program manually (without mpirun)?

Does this happen for all MPI programs (potentially only those that use the MPI-2 one-sided stuff), or just your R environment?

At this point, all I can suggest is firing up a debugger and stepping through the code in ld_dlopenext() to see why exactly it is failing. Since we call lt_dlopenext() many, many times (and you're only interested in when we call it for the osc pt2pt component), I'd suggest something like the following:

- it is easier if the problem also occurs when you run the program serially (without mpirun) -- just run it in a debugger
- break in ompi_osc_base_open
- set a breakpoint for lt_dlopenext
- continue until you hit the lt_dlopenext function
- print the filename; it will be either the pt2pt or rdma osc components
- step through the lt_dlopenext function and see if you can track down the exact error

Sorry I don't have a better suggestion than this...  :-\

--
Jeff Squyres
Cisco Systems


Reply via email to