On Jan 30, 2009, at 4:54 PM, Dirk Eddelbuettel wrote:

| > where things end in the loop over oapl_list() elements. I still see a
| > fprintf() statment just before
| >
| >   if (MCA_SUCCESS == component->mca_register_component_params()) {
| >
| > in the middle of the open_components function in the file
| > mca_base_components_open.c
|
| Do you know if component is non-NULL and has a sensible value (i.e.,
| pointing to a valid component)?

Do not. Everything (in particular below /etc/openmpi/) is at default values
with the sole exception of

# edd 18 Dec 2008
mca_component_show_load_errors = 0

Could that kill it? [ Goes off and tests... ] No, still dies with segfault
in open_components.

FWIW: mca_component_show_load_errors should only affect conditional output of some warning messages.

| Does ompi_info work?  (ompi_info uses this exact same code to find/
| open components)  If ompi_info fails, you should be able to attach a
| debugger to that, since it's a serial and [relatively] straightforward
| app.

Yes, ompi_info happily runs and returns around 111 lines. It seems to loop
over around 25 mca components.

Open MPI is otherwise healthy and happy. It's just that Rmpi does not get along with Open MPI 1.3 .... but this happens to be my personal use- case :-/

Quite puzzling. This portion of the code has already successfully opened the components and is looping over a list of the components that were found. It *sounds* like that list has somehow gotten corrupted.

Is there any way you can check that the values of component and component->mca_register_component_params are non-NULL / valid?

FWIW, component should be a pointer to the struct that we use to represent plugins; it's a member of the list element from the list of found components. Here's some code from right above the problematic line:

    for (item = opal_list_get_first(src);
         opal_list_get_end(src) != item;
         item = opal_list_get_next(item)) {
        cli = (mca_base_component_list_item_t *) item;
        component = cli->cli_component;

So you might want to examine cli as well and ensure that it has sensible values (the casting trick that we do is fairly common in the OMPI code base -- the list item is the first data member of the mca_base_component_list_item_t, so we can cast to/from it as required).

--
Jeff Squyres
Cisco Systems

Reply via email to