[OMPI users] mca_coll_hcoll.so: undefined symbol hcoll_group_destroy_notify

2014-04-08 Thread Anthony Alba
Hi all,

Ran into a problem running the openshmem examples/ using OpenMPI 1.8
compiled with
--with-knem=/opt/knem-1.1.90mlnx2 --with-hcoll=/opt/mellanox/hcoll
--with-mxm=/opt/mellanox/mxm
--with-fca=/opt/mellanox/fca


lib/openmpi/mca_coll_hcoll.so has undefined symbol
hcoll_group_destroy_notify

I can't find this symbol anywere. The Mellanox libraries
/opt/mellanox/hcoll/lib/*so don't export
this symbol. hcoll is v2.0.472.1

It is used in ompi/mca/coll/hcoll/coll_hcoll_module.c:

int hcoll_comm_attr_del_fn(MPI_Comm comm, int keyval, void *attr_val, void
*extra)
{

mca_coll_hcoll_module_t *hcoll_module;
hcoll_module = (mca_coll_hcoll_module_t*) attr_val;

hcoll_group_destroy_notify(hcoll_module->hcoll_context);
return OMPI_SUCCESS;

}


Re: [OMPI users] mca_coll_hcoll.so: undefined symbol hcoll_group_destroy_notify

2014-04-08 Thread Anthony Alba
Joshua,
I am running MOFED 2.1-1.0.6 and self-compiled openmpi-1.8 using
--with-hcoll.

The symbol is in 1.8 source but not exported by MOFED
/opt/mellanox/hcoll/lib*
On 8 Apr 2014 21:47, "Joshua Ladd"  wrote:

>  Hi,
>
>
>
> What MOFED version are you running?
>
>
>
> Best,
>
>
>
> Josh
>
>
>
> *From:* users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Anthony
> Alba
> *Sent:* Tuesday, April 08, 2014 4:53 AM
> *To:* us...@open-mpi.org
> *Subject:* [OMPI users] mca_coll_hcoll.so: undefined symbol
> hcoll_group_destroy_notify
>
>
>
> Hi all,
>
>
>
> Ran into a problem running the openshmem examples/ using OpenMPI 1.8
> compiled with
>
> --with-knem=/opt/knem-1.1.90mlnx2 --with-hcoll=/opt/mellanox/hcoll
> --with-mxm=/opt/mellanox/mxm
>
> --with-fca=/opt/mellanox/fca
>
>
>
>
>
> lib/openmpi/mca_coll_hcoll.so has undefined symbol
> hcoll_group_destroy_notify
>
>
>
> I can't find this symbol anywere. The Mellanox libraries
> /opt/mellanox/hcoll/lib/*so don't export
>
> this symbol. hcoll is v2.0.472.1
>
>
>
> It is used in ompi/mca/coll/hcoll/coll_hcoll_module.c:
>
>
>
> int hcoll_comm_attr_del_fn(MPI_Comm comm, int keyval, void *attr_val, void
> *extra)
>
> {
>
>
>
> mca_coll_hcoll_module_t *hcoll_module;
>
> hcoll_module = (mca_coll_hcoll_module_t*) attr_val;
>
>
>
> hcoll_group_destroy_notify(hcoll_module->hcoll_context);
>
> return OMPI_SUCCESS;
>
>
>
> }
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] mca_coll_hcoll.so: undefined symbol hcoll_group_destroy_notify

2014-04-08 Thread Anthony Alba
This is a change from OMPI 1.7.4 to 1.7.5, 1.8: the symbol is not used in
MOFED 2.1-1.0.6 openmpi-1.7.4 (I rebuilt the MOFED RPM to enable hcoll).

- Anthony



[OMPI users] [SOLVED] Re: mca_coll_hcoll.so: undefined symbol hcoll_group_destroy_notify

2014-04-08 Thread Anthony Alba
The devel list has responded that this requires a later drop of hcoll than
in MOFED 2.1-1.0.6.

- Anthony
On Apr 9, 2014 9:49 AM, "Anthony Alba"  wrote:

> This is a change from OMPI 1.7.4 to 1.7.5, 1.8: the symbol is not used in
> MOFED 2.1-1.0.6 openmpi-1.7.4 (I rebuilt the MOFED RPM to enable hcoll).
>
> - Anthony
>


[OMPI users] Troubleshooting mpirun with tree spawn hang

2014-04-11 Thread Anthony Alba
Is there a way to troubleshoot
plm_rsh_no_tree_spawn=true hang?

I have a set of passwordless-ssh nodes, each node can ssh into any other.,
i.e.,

for h1 in A B C D; do for h2 in A B C D; do ssh $h1 ssh $h2 hostname; done;
done

works perfectly.

Generally tree spawn works, however there is one host where
launching  mpirun with tree spawn hangs as soon as there are 6 or more host
(with launch node also in the host list). If the launcher is not in the
host list the hang happens with five hosts.


- Anthony


Re: [OMPI users] Troubleshooting mpirun with tree spawn hang

2014-04-11 Thread Anthony Alba
Ooops I meant = false.

Thanks for the tip, it turns out the fault lay in a specific node that
required oob_tcp_if_include to be set.

On Friday, 11 April 2014, Ralph Castain  wrote:

> I'm a little confused - the "no_tree_spawn=true" option means that we are
> *not* using tree spawn, and so mpirun is directly launching each daemon
> onto its node. Thus, this requires that the host mpirun is on be able to
> ssh to every other host in the allocation.
>
> You can debug the rsh launcher by setting "-mca plm_base_verbose 5
> --debug-daemons" on the cmd line.
>
>
> On Apr 10, 2014, at 9:50 PM, Anthony Alba 
> >
> wrote:
>
> >
> > Is there a way to troubleshoot
> > plm_rsh_no_tree_spawn=true hang?
> >
> > I have a set of passwordless-ssh nodes, each node can ssh into any
> other., i.e.,
> >
> > for h1 in A B C D; do for h2 in A B C D; do ssh $h1 ssh $h2 hostname;
> done; done
> >
> > works perfectly.
> >
> > Generally tree spawn works, however there is one host where
> > launching  mpirun with tree spawn hangs as soon as there are 6 or more
> host (with launch node also in the host list). If the launcher is not in
> the host list the hang happens with five hosts.
> >
> >
> > - Anthony
> >
> > ___
> > users mailing list
> > us...@open-mpi.org 
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org 
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] OpenMPI PMI2 with SLURM 14.03 not working

2014-04-11 Thread Anthony Alba
Not sure if this is a SLURM or OMPI issue so please bear with the
cross-posting...

The OpenMPI FAQ mentions an issue with slurm 2.6.3/pmi2.
https://www.open-mpi.org/faq/?category=slurm#slurm-2.6.3-issue

I have built both 1.7.5/1.8 against slurm 14.03/pmi2.

When I launch openmpi/examples/hello_c on a single node allocation:

srun --mpi=pmi2 -N 1 hello_c:

srun -N 1 --mpi=pmi2 hello_c
srun: error: _server_read: fd 18 got error or unexpected eof reading header
srun: error: step_launch_notify_io_failure: aborting, io error with
slurmstepd on node 0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete


with --slurmd-debug=9: (I'm not sure what is the meaning of "ip
111.110.61.48 sd 14"
below, is that ip as in ip address? It is not the ip address of any Nodes
in my partition)

slurmstepd: mpi/pmi2: client_resp_send: 26cmd=kvs-put-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14 cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: _tree_listen_read: accepted tree connection: ip 111.110.61.48
sd 14
slurmstepd: _handle_accept_rank: going to read() client rank
slurmstepd: _handle_accept_rank: got client rank 1478164480 on fd 14
srun: error: _server_read: fd 18 got error or unexpected eof reading header
srun: error: step_launch_notify_io_failure: aborting, io error with
slurmstepd on node 0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Launching with salloc/sbatch works.

- Anthony


Re: [OMPI users] OpenMPI PMI2 with SLURM 14.03 not working [SOLVED]

2014-04-11 Thread Anthony Alba
Answered in the slurm-devel list: it is a bug in SLURM 14.03.

The fix is already in HEAD and also will be in 14.03.1

https://groups.google.com/forum/#!topic/slurm-devel/1ctPkEn7TFI

- Anthony