I changed my downloaded MOFED version to match the one installed on the node and now the error goes away and it runs fine. But I still have a question, I get the exact same performance on all the below 3 cases:
1) mpirun --allow-run-as-root --mca mtl mxm -mca mtl_mxm_np 0 -x MXM_TLS=self,shm,rc,ud -n 1 /root/backend localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2 2) mpirun --allow-run-as-root --mca mtl mxm -n 1 /root/backend localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2 3) mpirun --allow-run-as-root --mca mtl ^mxm -n 1 /root/backend localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2 Seems like it doesn't matter if I use mxm, not use mxm or use it with reliable connection (RC). How can I be sure I am indeed using mxm over infiniband? Thanks, Subhra. On Thu, Apr 23, 2015 at 1:06 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > /usr/bin/ofed_info > > So, the OFED on your system is not MellanoxOFED 2.4.x but smth else. > > try #rpm -qi libibverbs > > > On Thu, Apr 23, 2015 at 7:47 AM, Subhra Mazumdar < > subhramazumd...@gmail.com> wrote: > >> Hi, >> >> where is the command ofed_info located? I searched from / but didn't find >> it. >> >> Subhra. >> >> On Tue, Apr 21, 2015 at 10:43 PM, Mike Dubman <mi...@dev.mellanox.co.il> >> wrote: >> >>> cool, progress! >>> >>> >>1429676565.124664] sys.c:719 MXM WARN Conflicting CPU >>> frequencies detected, using: 2601.00 >>> >>> means that cpu governor on your machine is not on "performance" mode >>> >>> >> MXM ERROR ibv_query_device() returned 38: Function not implemented >>> >>> indicates that ofed installed on your nodes is not indeed 2.4.-1.0.0 or >>> there is a mismatch between ofed kernel drivers version and ofed userspace >>> libraries version. >>> or you have multiple ofed libraries installed on your node and use >>> incorrect one. >>> could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0? >>> >>> >>> >>> >>> >>> On Wed, Apr 22, 2015 at 7:59 AM, Subhra Mazumdar < >>> subhramazumd...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I compiled the openmpi that comes inside the mellanox hpcx package with >>>> mxm support instead of separately downloaded openmpi. I also used the >>>> environment as in the README so that no LD_PRELOAD (except our own library >>>> which is unrelated) is needed. Now it runs fine (no segfault) but we get >>>> same errors as before (saying initialization of MXM library failed). Is it >>>> using MXM successfully? >>>> >>>> [root@JARVICE >>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# mpirun >>>> --allow-run-as-root --mca mtl mxm -n 1 /root/backend localhost : -x >>>> LD_PRELOAD=/root/libci.so -n 1 /root/app2 >>>> >>>> -------------------------------------------------------------------------- >>>> WARNING: a request was made to bind a process. While the system >>>> supports binding the process itself, at least one node does NOT >>>> support binding memory to the process location. >>>> >>>> Node: JARVICE >>>> >>>> This usually is due to not having the required NUMA support installed >>>> on the node. In some Linux distributions, the required support is >>>> contained in the libnumactl and libnumactl-devel packages. >>>> This is a warning only; your job will continue, though performance may >>>> be degraded. >>>> >>>> -------------------------------------------------------------------------- >>>> i am backend >>>> [1429676565.121218] sys.c:719 MXM WARN Conflicting CPU >>>> frequencies detected, using: 2601.00 >>>> [1429676565.122937] [JARVICE:14767:0] ib_dev.c:445 MXM WARN >>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>> [1429676565.122950] [JARVICE:14767:0] ib_dev.c:456 MXM ERROR >>>> ibv_query_device() returned 38: Function not implemented >>>> [1429676565.123535] [JARVICE:14767:0] ib_dev.c:445 MXM WARN >>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>> [1429676565.123543] [JARVICE:14767:0] ib_dev.c:456 MXM ERROR >>>> ibv_query_device() returned 38: Function not implemented >>>> [1429676565.124664] sys.c:719 MXM WARN Conflicting CPU >>>> frequencies detected, using: 2601.00 >>>> [1429676565.126264] [JARVICE:14768:0] ib_dev.c:445 MXM WARN >>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>> [1429676565.126276] [JARVICE:14768:0] ib_dev.c:456 MXM ERROR >>>> ibv_query_device() returned 38: Function not implemented >>>> [1429676565.126812] [JARVICE:14768:0] ib_dev.c:445 MXM WARN >>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>> [1429676565.126821] [JARVICE:14768:0] ib_dev.c:456 MXM ERROR >>>> ibv_query_device() returned 38: Function not implemented >>>> >>>> -------------------------------------------------------------------------- >>>> Initialization of MXM library failed. >>>> >>>> Error: Input/output error >>>> >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> <application runs fine> >>>> >>>> >>>> Thanks, >>>> Subhra. >>>> >>>> >>>> On Sat, Apr 18, 2015 at 12:28 AM, Mike Dubman <mi...@dev.mellanox.co.il >>>> > wrote: >>>> >>>>> could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0? >>>>> why LD_PRELOAD needed in your command line? Can you try >>>>> >>>>> module load hpcx >>>>> mpirun -np $np test.exe >>>>> ? >>>>> >>>>> On Sat, Apr 18, 2015 at 8:39 AM, Subhra Mazumdar < >>>>> subhramazumd...@gmail.com> wrote: >>>>> >>>>>> I followed the instructions as in the README, now getting a different >>>>>> error: >>>>>> >>>>>> [root@JARVICE >>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# >>>>>> ../openmpi-1.8.4/openmpinstall/bin/mpirun --allow-run-as-root --mca mtl >>>>>> mxm >>>>>> -x LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>> ./mxm/lib/libmxm.so.2" -n 1 ../backend localhost : -x >>>>>> LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>> ./mxm/lib/libmxm.so.2 ../libci.so" -n 1 ../app2 >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> WARNING: a request was made to bind a process. While the system >>>>>> >>>>>> supports binding the process itself, at least one node does NOT >>>>>> >>>>>> support binding memory to the process location. >>>>>> >>>>>> Node: JARVICE >>>>>> >>>>>> This usually is due to not having the required NUMA support installed >>>>>> >>>>>> on the node. In some Linux distributions, the required support is >>>>>> >>>>>> contained in the libnumactl and libnumactl-devel packages. >>>>>> >>>>>> This is a warning only; your job will continue, though performance >>>>>> may be degraded. >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> i am backend >>>>>> >>>>>> [1429334876.139452] [JARVICE:449 :0] ib_dev.c:445 MXM WARN >>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>> >>>>>> [1429334876.139464] [JARVICE:449 :0] ib_dev.c:456 MXM ERROR >>>>>> ibv_query_device() returned 38: Function not implemented >>>>>> >>>>>> [1429334876.139982] [JARVICE:449 :0] ib_dev.c:445 MXM WARN >>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>> >>>>>> [1429334876.139990] [JARVICE:449 :0] ib_dev.c:456 MXM ERROR >>>>>> ibv_query_device() returned 38: Function not implemented >>>>>> >>>>>> [1429334876.142649] [JARVICE:450 :0] ib_dev.c:445 MXM WARN >>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>> >>>>>> [1429334876.142666] [JARVICE:450 :0] ib_dev.c:456 MXM ERROR >>>>>> ibv_query_device() returned 38: Function not implemented >>>>>> >>>>>> [1429334876.143235] [JARVICE:450 :0] ib_dev.c:445 MXM WARN >>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>> >>>>>> [1429334876.143243] [JARVICE:450 :0] ib_dev.c:456 MXM ERROR >>>>>> ibv_query_device() returned 38: Function not implemented >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> Initialization of MXM library failed. >>>>>> >>>>>> Error: Input/output error >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> [JARVICE:449 :0] Caught signal 11 (Segmentation fault) >>>>>> >>>>>> [JARVICE:450 :0] Caught signal 11 (Segmentation fault) >>>>>> >>>>>> ==== backtrace ==== >>>>>> >>>>>> 2 0x000000000005640c mxm_handle_error() >>>>>> >>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:641 >>>>>> >>>>>> 3 0x000000000005657c mxm_error_signal_handler() >>>>>> >>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:616 >>>>>> >>>>>> 4 0x00000000000329a0 killpg() ??:0 >>>>>> >>>>>> 5 0x000000000004812c _IO_vfprintf() ??:0 >>>>>> >>>>>> 6 0x000000000006f6da vasprintf() ??:0 >>>>>> >>>>>> 7 0x0000000000059b3b opal_show_help_vstring() ??:0 >>>>>> >>>>>> 8 0x0000000000026630 orte_show_help() ??:0 >>>>>> >>>>>> 9 0x0000000000001a3f mca_bml_r2_add_procs() >>>>>> >>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/bml/r2/bml_r2.c:409 >>>>>> >>>>>> 10 0x0000000000004475 mca_pml_ob1_add_procs() >>>>>> >>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/pml/ob1/pml_ob1.c:332 >>>>>> >>>>>> 11 0x00000000000442f3 ompi_mpi_init() ??:0 >>>>>> >>>>>> 12 0x0000000000067cb0 PMPI_Init_thread() ??:0 >>>>>> >>>>>> 13 0x000000000000d0ca l_getLocalFromConfig() >>>>>> /root/rain_ib/interposer/libciutils.c:83 >>>>>> >>>>>> 14 0x000000000000c7b4 __cudaRegisterFatBinary() >>>>>> /root/rain_ib/interposer/libci.c:4055 >>>>>> >>>>>> 15 0x0000000000402b59 >>>>>> _ZL70__sti____cudaRegisterAll_39_tmpxft_00000703_00000000_6_app2_cpp1_ii_hwv() >>>>>> tmpxft_00000703_00000000-3_app2.cudafe1.cpp:0 >>>>>> >>>>>> 16 0x0000000000402dd6 __do_global_ctors_aux() crtstuff.c:0 >>>>>> >>>>>> =================== >>>>>> >>>>>> ==== backtrace ==== >>>>>> >>>>>> 2 0x000000000005640c mxm_handle_error() >>>>>> >>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:641 >>>>>> >>>>>> 3 0x000000000005657c mxm_error_signal_handler() >>>>>> >>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:616 >>>>>> >>>>>> 4 0x00000000000329a0 killpg() ??:0 >>>>>> >>>>>> 5 0x000000000004812c _IO_vfprintf() ??:0 >>>>>> >>>>>> 6 0x000000000006f6da vasprintf() ??:0 >>>>>> >>>>>> 7 0x0000000000059b3b opal_show_help_vstring() ??:0 >>>>>> >>>>>> 8 0x0000000000026630 orte_show_help() ??:0 >>>>>> >>>>>> 9 0x0000000000001a3f mca_bml_r2_add_procs() >>>>>> >>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/bml/r2/bml_r2.c:409 >>>>>> >>>>>> 10 0x0000000000004475 mca_pml_ob1_add_procs() >>>>>> >>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/pml/ob1/pml_ob1.c:332 >>>>>> >>>>>> 11 0x00000000000442f3 ompi_mpi_init() ??:0 >>>>>> >>>>>> 12 0x0000000000067cb0 PMPI_Init_thread() ??:0 >>>>>> >>>>>> 13 0x0000000000404fdf main() /root/rain_ib/backend/backend.c:1237 >>>>>> >>>>>> 14 0x000000000001ed1d __libc_start_main() ??:0 >>>>>> >>>>>> 15 0x0000000000402db9 _start() ??:0 >>>>>> >>>>>> =================== >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> mpirun noticed that process rank 1 with PID 450 on node JARVICE >>>>>> exited on signal 11 (Segmentation fault). >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> [JARVICE:00447] 1 more process has sent help message help-mtl-mxm.txt >>>>>> / mxm init >>>>>> >>>>>> [JARVICE:00447] Set MCA parameter "orte_base_help_aggregate" to 0 to >>>>>> see all help / error messages >>>>>> >>>>>> [root@JARVICE >>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# >>>>>> >>>>>> >>>>>> Subhra. >>>>>> >>>>>> >>>>>> On Mon, Apr 13, 2015 at 10:58 PM, Mike Dubman < >>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>> >>>>>>> Have you followed installation steps from README (Also here for >>>>>>> reference http://bgate.mellanox.com/products/hpcx/README.txt) >>>>>>> >>>>>>> ... >>>>>>> >>>>>>> * Load OpenMPI/OpenSHMEM v1.8 based package: >>>>>>> >>>>>>> % source $HPCX_HOME/hpcx-init.sh >>>>>>> % hpcx_load >>>>>>> % env | grep HPCX >>>>>>> % mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_usempi >>>>>>> % oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem >>>>>>> % hpcx_unload >>>>>>> >>>>>>> 3. Load HPCX environment from modules >>>>>>> >>>>>>> * Load OpenMPI/OpenSHMEM based package: >>>>>>> >>>>>>> % module use $HPCX_HOME/modulefiles >>>>>>> % module load hpcx >>>>>>> % mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c >>>>>>> % oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem >>>>>>> % module unload hpcx >>>>>>> >>>>>>> ... >>>>>>> >>>>>>> On Tue, Apr 14, 2015 at 5:42 AM, Subhra Mazumdar < >>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>> >>>>>>>> I am using 2.4-1.0.0 mellanox ofed. >>>>>>>> >>>>>>>> I downloaded mofed tarball >>>>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5.tar and >>>>>>>> extracted >>>>>>>> it. It has mxm directory. >>>>>>>> >>>>>>>> hpcx-v1.2.0-325-[root@JARVICE ~]# ls >>>>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5 >>>>>>>> archive fca hpcx-init-ompi-mellanox-v1.8.sh ibprof >>>>>>>> modulefiles ompi-mellanox-v1.8 sources VERSION >>>>>>>> bupc-master hcoll hpcx-init.sh knem >>>>>>>> mxm README.txt utils >>>>>>>> >>>>>>>> I tried using LD_PRELOAD for libmxm, but getting a different error >>>>>>>> stack now as following >>>>>>>> >>>>>>>> [root@JARVICE ~]# ./openmpi-1.8.4/openmpinstall/bin/mpirun >>>>>>>> --allow-run-as-root --mca mtl mxm -x >>>>>>>> LD_PRELOAD="./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>>>> ./hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm/lib/libmxm.so.2" >>>>>>>> -n 1 ./backend localhost : -x >>>>>>>> LD_PRELOAD="./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>>>> ./hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm/lib/libmxm.so.2 >>>>>>>> ./libci.so" -n 1 ./app2 >>>>>>>> i am backend >>>>>>>> [JARVICE:00564] mca: base: components_open: component pml / cm open >>>>>>>> function failed >>>>>>>> [JARVICE:564 :0] Caught signal 11 (Segmentation fault) >>>>>>>> [JARVICE:00565] mca: base: components_open: component pml / cm open >>>>>>>> function failed >>>>>>>> [JARVICE:565 :0] Caught signal 11 (Segmentation fault) >>>>>>>> ==== backtrace ==== >>>>>>>> 2 0x000000000005640c mxm_handle_error() >>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:641 >>>>>>>> 3 0x000000000005657c mxm_error_signal_handler() >>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:616 >>>>>>>> 4 0x00000000000329a0 killpg() ??:0 >>>>>>>> 5 0x0000000000045491 mca_base_components_close() ??:0 >>>>>>>> 6 0x000000000004e99a mca_base_framework_close() ??:0 >>>>>>>> 7 0x0000000000045431 mca_base_component_close() ??:0 >>>>>>>> 8 0x000000000004515c mca_base_framework_components_open() ??:0 >>>>>>>> 9 0x00000000000a0de9 mca_pml_base_open() pml_base_frame.c:0 >>>>>>>> 10 0x000000000004eb1c mca_base_framework_open() ??:0 >>>>>>>> 11 0x0000000000043eb3 ompi_mpi_init() ??:0 >>>>>>>> 12 0x0000000000067cb0 PMPI_Init_thread() ??:0 >>>>>>>> 13 0x0000000000404fdf main() /root/rain_ib/backend/backend.c:1237 >>>>>>>> 14 0x000000000001ed1d __libc_start_main() ??:0 >>>>>>>> 15 0x0000000000402db9 _start() ??:0 >>>>>>>> =================== >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> A requested component was not found, or was unable to be opened. >>>>>>>> This >>>>>>>> means that this component is either not installed or is unable to be >>>>>>>> used on your system (e.g., sometimes this means that shared >>>>>>>> libraries >>>>>>>> that the component requires are unable to be found/loaded). Note >>>>>>>> that >>>>>>>> Open MPI stopped checking at the first component that it did not >>>>>>>> find. >>>>>>>> >>>>>>>> Host: JARVICE >>>>>>>> Framework: mtl >>>>>>>> Component: mxm >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> mpirun noticed that process rank 0 with PID 564 on node JARVICE >>>>>>>> exited on signal 11 (Segmentation fault). >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> [JARVICE:00562] 1 more process has sent help message >>>>>>>> help-mca-base.txt / find-available:not-valid >>>>>>>> [JARVICE:00562] Set MCA parameter "orte_base_help_aggregate" to 0 >>>>>>>> to see all help / error messages >>>>>>>> >>>>>>>> >>>>>>>> Subhra >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Apr 12, 2015 at 10:48 PM, Mike Dubman < >>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>> >>>>>>>>> seems like mxm was not found in your ld_library_path. >>>>>>>>> >>>>>>>>> what mofed version do you use? >>>>>>>>> does it have /opt/mellanox/mxm in it? >>>>>>>>> You could just run mpirun from HPCX package which looks for mxm >>>>>>>>> internally and recompile ompi as mentioned in README. >>>>>>>>> >>>>>>>>> On Mon, Apr 13, 2015 at 3:24 AM, Subhra Mazumdar < >>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I used mxm mtl as follows but getting segfault. It says mxm >>>>>>>>>> component not found but I have compiled openmpi with mxm. Any idea >>>>>>>>>> what I >>>>>>>>>> might be missing? >>>>>>>>>> >>>>>>>>>> [root@JARVICE ~]# ./openmpi-1.8.4/openmpinstall/bin/mpirun >>>>>>>>>> --allow-run-as-root --mca pml cm --mca mtl mxm -n 1 -x >>>>>>>>>> LD_PRELOAD=./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 ./backend >>>>>>>>>> localhosst : -n 1 -x LD_PRELOAD="./libci.so >>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1" ./app2 >>>>>>>>>> i am backend >>>>>>>>>> [JARVICE:08398] *** Process received signal *** >>>>>>>>>> [JARVICE:08398] Signal: Segmentation fault (11) >>>>>>>>>> [JARVICE:08398] Signal code: Address not mapped (1) >>>>>>>>>> [JARVICE:08398] Failing at address: 0x10 >>>>>>>>>> [JARVICE:08398] [ 0] >>>>>>>>>> /lib64/libpthread.so.0(+0xf710)[0x7ff8d0ddb710] >>>>>>>>>> [JARVICE:08398] [ 1] >>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_components_close+0x21)[0x7ff8cf9ae491] >>>>>>>>>> [JARVICE:08398] [ 2] >>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_close+0x6a)[0x7ff8cf9b799a] >>>>>>>>>> [JARVICE:08398] [ 3] >>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_component_close+0x21)[0x7ff8cf9ae431] >>>>>>>>>> [JARVICE:08398] [ 4] >>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_components_open+0x11c)[0x7ff8cf9ae15c] >>>>>>>>>> [JARVICE:08398] [ 5] >>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(+0xa0de9)[0x7ff8d1089de9] >>>>>>>>>> [JARVICE:08398] [ 6] >>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7ff8cf9b7b1c] >>>>>>>>>> [JARVICE:08398] [ 7] [JARVICE:08398] mca: base: components_open: >>>>>>>>>> component pml / cm open function failed >>>>>>>>>> >>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(ompi_mpi_init+0x4b3)[0x7ff8d102ceb3] >>>>>>>>>> [JARVICE:08398] [ 8] >>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(PMPI_Init_thread+0x100)[0x7ff8d1050cb0] >>>>>>>>>> [JARVICE:08398] [ 9] ./backend[0x404fdf] >>>>>>>>>> [JARVICE:08398] [10] >>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff8cfeded1d] >>>>>>>>>> [JARVICE:08398] [11] ./backend[0x402db9] >>>>>>>>>> [JARVICE:08398] *** End of error message *** >>>>>>>>>> >>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>> A requested component was not found, or was unable to be opened. >>>>>>>>>> This >>>>>>>>>> means that this component is either not installed or is unable to >>>>>>>>>> be >>>>>>>>>> used on your system (e.g., sometimes this means that shared >>>>>>>>>> libraries >>>>>>>>>> that the component requires are unable to be found/loaded). Note >>>>>>>>>> that >>>>>>>>>> Open MPI stopped checking at the first component that it did not >>>>>>>>>> find. >>>>>>>>>> >>>>>>>>>> Host: JARVICE >>>>>>>>>> Framework: mtl >>>>>>>>>> Component: mxm >>>>>>>>>> >>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>> >>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>> mpirun noticed that process rank 0 with PID 8398 on node JARVICE >>>>>>>>>> exited on signal 11 (Segmentation fault). >>>>>>>>>> >>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Subhra. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Apr 10, 2015 at 12:12 AM, Mike Dubman < >>>>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>>>> >>>>>>>>>>> no need IPoIB, mxm uses native IB. >>>>>>>>>>> >>>>>>>>>>> Please see HPCX (pre-compiled ompi, integrated with MXM and FCA) >>>>>>>>>>> README file for details how to compile/select. >>>>>>>>>>> >>>>>>>>>>> The default transport is UD for internode communication and >>>>>>>>>>> shared-memory for intra-node. >>>>>>>>>>> >>>>>>>>>>> http://bgate,mellanox.com/products/hpcx/ >>>>>>>>>>> >>>>>>>>>>> Also, mxm included in the Mellanox OFED. >>>>>>>>>>> >>>>>>>>>>> On Fri, Apr 10, 2015 at 5:26 AM, Subhra Mazumdar < >>>>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Does ipoib need to be configured on the ib cards for mxm (I >>>>>>>>>>>> have a separate ethernet connection too)? Also are there special >>>>>>>>>>>> flags in >>>>>>>>>>>> mpirun to select from UD/RC/DC? What is the default? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Subhra. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 31, 2015 at 9:46 AM, Mike Dubman < >>>>>>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> mxm uses IB rdma/roce technologies. Once can select UD/RC/DC >>>>>>>>>>>>> transports to be used in mxm. >>>>>>>>>>>>> >>>>>>>>>>>>> By selecting mxm, all MPI p2p routines will be mapped to >>>>>>>>>>>>> appropriate mxm functions. >>>>>>>>>>>>> >>>>>>>>>>>>> M >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Mar 30, 2015 at 7:32 PM, Subhra Mazumdar < >>>>>>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi MIke, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Does the mxm mtl use infiniband rdma? Also from programming >>>>>>>>>>>>>> perspective, do I need to use anything else other than >>>>>>>>>>>>>> MPI_Send/MPI_Recv? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Subhra. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sun, Mar 29, 2015 at 11:14 PM, Mike Dubman < >>>>>>>>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> openib btl does not support this thread model. >>>>>>>>>>>>>>> You can use OMPI w/ mxm (-mca mtl mxm) and multiple thread >>>>>>>>>>>>>>> mode lin 1.8 x series or (-mca pml yalla) in the master branch. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> M >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Mar 30, 2015 at 9:09 AM, Subhra Mazumdar < >>>>>>>>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Can MPI_THREAD_MULTIPLE and openib btl work together in >>>>>>>>>>>>>>>> open mpi 1.8.4? If so are there any command line options >>>>>>>>>>>>>>>> needed during run >>>>>>>>>>>>>>>> time? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Subhra. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26574.php >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Kind Regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> M. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26575.php >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26580.php >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> >>>>>>>>>>>>> Kind Regards, >>>>>>>>>>>>> >>>>>>>>>>>>> M. >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> Subscription: >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26584.php >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> Subscription: >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26663.php >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> Kind Regards, >>>>>>>>>>> >>>>>>>>>>> M. >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26665.php >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26686.php >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Kind Regards, >>>>>>>>> >>>>>>>>> M. >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26688.php >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26711.php >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Kind Regards, >>>>>>> >>>>>>> M. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26712.php >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26752.php >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Kind Regards, >>>>> >>>>> M. >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2015/04/26754.php >>>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/04/26761.php >>>> >>> >>> >>> >>> -- >>> >>> Kind Regards, >>> >>> M. >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/04/26762.php >>> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/04/26766.php >> > > > > -- > > Kind Regards, > > M. > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26768.php >