Might be worth trying with --mca btl_openib_cpc_include udcm   and see if that 
works.

-Nathan

On Aug 23, 2016, at 02:41 AM, "Juan A. Cordero Varelaq" 
<bioinformatica-i...@us.es> wrote:

Hi Gilles,
If I run it like this:
mpirun --mca btl ^openib,usnic --mca pml ob1 --mca btl_sm_use_knem 0 -np 5 
myscript.sh
it works fine. Am I using infiniband in this way? However, if I remove openib, 
I get the librdmacm: Fatal: unable to open RDMA device error. So what would be 
the most optimal way to run it?

Thanks a lot


On 23/08/16 10:17, Gilles Gouaillardet wrote:
Juan,

if you want to use infiniband with the openib/btl (i am assuming MXM is not 
available on your platform, and you to not want
to use infiniband via usnic/libfabric), you can
mpirun --mca pml ob1 --mca btl ^usnic ...
/* i am pretty sure mpirun ... would do the trick too */
if you get the same error message you posted earlier, then there is an issue 
with your infiniband fabric, and your sysadmin will hopefully solve that.

Cheers,

Gilles

On 8/23/2016 5:02 PM, Juan A. Cordero Varelaq wrote:
Hi Gilles,
so if I use rthe option --mca pml ob1, I use infiniband and it will be as fast 
as normal, right?

Thanks

On 22/08/16 14:22, Gilles Gouaillardet wrote:
Juan,

to keep things simple, --mca pml ob1 ensures you are not using mxm
(yet an other way to use infiniband)

IPoIB is unlikely working on your system now, so for inter node communications, 
you will use tcp with the interconnect you have (GbE or 10 GbE if you are lucky)
in term of performance, GbE(Gigabit Ethernet) is an order (or two) of magnitude 
slower than infiniband, both in term of bandwidth and latency)

if you still face issues with version mismatch, you can
mpirun --tag-output ldd a.out
in your script, and double check all nodes are using the correct libs

Cheers,

Gilles

On Monday, August 22, 2016, Juan A. Cordero Varelaq <bioinformatica-i...@us.es> 
wrote:
Hi Gilles,
adding ,usnic made it work :) --mca pml ob1 would not be then needed.
Does it render mpi very slow if infiniband is disabled (what does --mca pml 
pb1?)?

Regarding the version mismatch, everything seems to be right. When only one 
version is loaded, I see the PATH and the LD_LIBRARY_PATH for only one version, 
and with strings, everything seems to reference the right version.

Thanks a lot for the quick answers!
On 22/08/16 13:09, Gilles Gouaillardet wrote:
Juan,

can you try to
mpirun --mca btl ^openib,usnic --mca pml ob1 ...

note this simply disable native infiniband. from a performance point of view, 
you should have your sysadmin fix the infiniband fabric.

about the version mismatch, please double check your environment
(e.g. $PATH and $LD_LIBRARY_PATH), it is likely v2.0 is in your environment 
when you are using v1.8, or the other way around)
also, make sure orted are launched with the right environment.
if you are using ssh, then
ssh node env
should not contain reference to the version you are not using.
(that typically occurs when the environment is set in the .bashrc, directly or 
via modules)

last but not least, you can
strings /.../bin/orted
strings /.../lib/libmpi.so
and check they do not reference the wrong version
(that can happen if a library was built and then moved)


Cheers,

Gilles

On Monday, August 22, 2016, Juan A. Cordero Varelaq <bioinformatica-i...@us.es> 
wrote:
Dear Ralph,
The existence of the two versions does not seem to be the source of problems, 
since they are in different locations. I uninstalled the most recent version 
and try again with no luck, getting the same warnings/errors. However, after a 
deep search I found a couple of hints, and executed this:
mpirun -mca btl ^openib -mca btl_sm_use_knem 0 -np 5 myscript
and got only a fraction of the previous errors (before I had run the same but 
without the arguments in bold), related to OpenFabrics:
Open MPI failed to open an OpenFabrics device.  This is an unusual
error; the system reported the OpenFabrics device as being present,
but then later failed to access it successfully.  This usually
indicates either a misconfiguration or a failed OpenFabrics hardware
device.

All OpenFabrics support has been disabled in this MPI process; your
job may or may not continue.

  Hostname:    MYMACHINE
  Device name: mlx4_0
  Errror (22): Invalid argument
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[[52062,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: usNIC
  Host: MYMACHINE

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Do you guess why it could happen?

Thanks a lot
On 19/08/16 17:11, r...@open-mpi.org wrote:
The rdma error sounds like something isn’t right with your machine’s Infiniband 
installation.

The cross-version problem sounds like you installed both OMPI versions into the 
same location - did you do that?? If so, then that might be the root cause of 
both problems. You need to install them in totally different locations. Then 
you need to _prefix_ your PATH and LD_LIBRARY_PATH with the location of the 
version you want to use.

HTH
Ralph

On Aug 19, 2016, at 12:53 AM, Juan A. Cordero Varelaq 
<bioinformatica-i...@us.es> wrote:

Dear users,

I am totally stuck using openmpi. I have two versions on my machine: 1.8.1 and 
2.0.0, and none of them work. When use the mpirun 1.8.1 version, I get the 
following error:

librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
--------------------------------------------------------------------------
Open MPI failed to open the /dev/knem device due to a local error.
Please check with your system administrator to get the problem fixed,
or set the btl_sm_use_knem MCA parameter to 0 to run without /dev/knem
support.

  Local host: MYMACHINE
  Errno:      2 (No such file or directory)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed to open an OpenFabrics device.  This is an unusual
error; the system reported the OpenFabrics device as being present,
but then later failed to access it successfully.  This usually
indicates either a misconfiguration or a failed OpenFabrics hardware
device.

All OpenFabrics support has been disabled in this MPI process; your
job may or may not continue.

  Hostname:    MYMACHINE
  Device name: mlx4_0
  Errror (22): Invalid argument
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[[60527,1],4]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: usNIC
  Host: MYMACHINE

When I use the 2.0.0 version, I get something strange, it seems openmpi-2.0.0 
looks for openmpi-1.8.1 libraries?:

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      MYMACHINE
Framework: ess
Component: pmi
--------------------------------------------------------------------------
[MYMACHINE:126820] *** Process received signal ***
[MYMACHINE:126820] Signal: Segmentation fault (11)
[MYMACHINE:126820] Signal code: Address not mapped (1)
[MYMACHINE:126820] Failing at address: 0x1c0
[MYMACHINE:126820] [ 0] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f39b2ec4cb0]
[MYMACHINE:126820] [ 1] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7f39b23e7430]
[MYMACHINE:126820] [ 2] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7f39b2676a57]
[MYMACHINE:126820] [ 3] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7f39b2676fb7]
[MYMACHINE:126820] [ 4] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7f39b267718f]
[MYMACHINE:126820] [ 5] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7f39b23c5f2a]
[MYMACHINE:126820] [ 6] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7f39b23c70c3]
[MYMACHINE:126820] [ 7] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7f39b23c8278]
[MYMACHINE:126820] [ 8] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7f39b23d1e6c]
[MYMACHINE:126820] [ 9] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7f39b2666e21]
[MYMACHINE:126820] [10] 
/opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7f39b3115c92]
[MYMACHINE:126820] [11] 
/opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7f39b31387bb]
[MYMACHINE:126820] [12] mb[0x402024]
[MYMACHINE:126820] [13] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f39b2b187ed]
[MYMACHINE:126820] [14] mb[0x402111]
[MYMACHINE:126820] *** End of error message ***
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      MYMACHINE
Framework: ess
Component: pmi
--------------------------------------------------------------------------
[MYMACHINE:126821] *** Process received signal ***
[MYMACHINE:126821] Signal: Segmentation fault (11)
[MYMACHINE:126821] Signal code: Address not mapped (1)
[MYMACHINE:126821] Failing at address: 0x1c0
[MYMACHINE:126821] [ 0] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fed834bbcb0]
[MYMACHINE:126821] [ 1] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7fed829de430]
[MYMACHINE:126821] [ 2] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7fed82c6da57]
[MYMACHINE:126821] [ 3] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7fed82c6dfb7]
[MYMACHINE:126821] [ 4] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7fed82c6e18f]
[MYMACHINE:126821] [ 5] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7fed829bcf2a]
[MYMACHINE:126821] [ 6] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7fed829be0c3]
[MYMACHINE:126821] [ 7] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7fed829bf278]
[MYMACHINE:126821] [ 8] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7fed829c8e6c]
[MYMACHINE:126821] [ 9] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7fed82c5de21]
[MYMACHINE:126821] [10] 
/opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7fed8370cc92]
[MYMACHINE:126821] [11] 
/opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7fed8372f7bb]
[MYMACHINE:126821] [12] mb[0x402024]
[MYMACHINE:126821] [13] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fed8310f7ed]
[MYMACHINE:126821] [14] mb[0x402111]
[MYMACHINE:126821] *** End of error message ***
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      MYMACHINE
Framework: ess
Component: pmi
--------------------------------------------------------------------------
[MYMACHINE:126822] *** Process received signal ***
[MYMACHINE:126822] Signal: Segmentation fault (11)
[MYMACHINE:126822] Signal code: Address not mapped (1)
[MYMACHINE:126822] Failing at address: 0x1c0
[MYMACHINE:126822] [ 0] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f0174bc0cb0]
[MYMACHINE:126822] [ 1] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7f01740e3430]
[MYMACHINE:126822] [ 2] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7f0174372a57]
[MYMACHINE:126822] [ 3] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7f0174372fb7]
[MYMACHINE:126822] [ 4] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7f017437318f]
[MYMACHINE:126822] [ 5] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7f01740c1f2a]
[MYMACHINE:126822] [ 6] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7f01740c30c3]
[MYMACHINE:126822] [ 7] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7f01740c4278]
[MYMACHINE:126822] [ 8] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7f01740cde6c]
[MYMACHINE:126822] [ 9] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7f0174362e21]
[MYMACHINE:126822] [10] 
/opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7f0174e11c92]
[MYMACHINE:126822] [11] 
/opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7f0174e347bb]
[MYMACHINE:126822] [12] mb[0x402024]
[MYMACHINE:126822] [13] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f01748147ed]
[MYMACHINE:126822] [14] mb[0x402111]
[MYMACHINE:126822] *** End of error message ***
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      MYMACHINE
Framework: ess
Component: pmi
--------------------------------------------------------------------------
[MYMACHINE:126823] *** Process received signal ***
[MYMACHINE:126823] Signal: Segmentation fault (11)
[MYMACHINE:126823] Signal code: Address not mapped (1)
[MYMACHINE:126823] Failing at address: 0x1c0
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      MYMACHINE
Framework: ess
Component: pmi
--------------------------------------------------------------------------
[MYMACHINE:126823] [ 0] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fcd9cb58cb0]
[MYMACHINE:126823] [ 1] [MYMACHINE:126824] *** Process received signal ***
[MYMACHINE:126824] Signal: Segmentation fault (11)
[MYMACHINE:126824] Signal code: Address not mapped (1)
[MYMACHINE:126824] Failing at address: 0x1c0
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7fcd9c07b430]
[MYMACHINE:126823] [ 2] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7fcd9c30aa57]
[MYMACHINE:126823] [ 3] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7fcd9c30afb7]
[MYMACHINE:126823] [ 4] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7fcd9c30b18f]
[MYMACHINE:126823] [ 5] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7fcd9c059f2a]
[MYMACHINE:126823] [MYMACHINE:126824] [ 0] [ 6] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f2f0c611cb0]
[MYMACHINE:126824] [ 1] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7fcd9c05b0c3]
[MYMACHINE:126823] [ 7] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7f2f0bb34430]
[MYMACHINE:126824] [ 2] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7f2f0bdc3a57]
[MYMACHINE:126824] [ 3] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7f2f0bdc3fb7]
[MYMACHINE:126824] [ 4] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7f2f0bdc418f]
[MYMACHINE:126824] [ 5] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7fcd9c05c278]
[MYMACHINE:126823] [ 8] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7fcd9c065e6c]
[MYMACHINE:126823] [ 9] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7fcd9c2fae21]
[MYMACHINE:126823] [10] 
/opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7fcd9cda9c92]
[MYMACHINE:126823] [11] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7f2f0bb12f2a]
[MYMACHINE:126824] [ 6] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7f2f0bb140c3]
[MYMACHINE:126824] [ 7] 
/opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7fcd9cdcc7bb]
[MYMACHINE:126823] [12] mb[0x402024]
[MYMACHINE:126823] [13] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7f2f0bb15278]
[MYMACHINE:126824] [ 8] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7f2f0bb1ee6c]
[MYMACHINE:126824] [ 9] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7f2f0bdb3e21]
[MYMACHINE:126824] [10] 
/opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7f2f0c862c92]
[MYMACHINE:126824] [11] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fcd9c7ac7ed]
[MYMACHINE:126823] [14] mb[0x402111]
[MYMACHINE:126823] *** End of error message ***
/opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7f2f0c8857bb]
[MYMACHINE:126824] [12] mb[0x402024]
[MYMACHINE:126824] [13] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f2f0c2657ed]
[MYMACHINE:126824] [14] mb[0x402111]
[MYMACHINE:126824] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node MYMACHINE exited on 
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I am running my script with mpirun in a single node of a SGE cluster.

I would be very grateful if somebody could give me some hints to solve this 
issue.

Thanks a lot in advance
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to