I am a little confused now, I ran 3 different ways and got 3 different performance from best to worse in following order:
1) mpirun --allow-run-as-root --mca pml ob1 -n 1 /root/backend localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2 2) mpirun --allow-run-as-root -n 1 /root/backend localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2 3) mpirun --allow-run-as-root --mca pml cm --mca mtl mxm -n 1 /root/backend localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2 Are all of the above using infiniband but in different ways? Thanks, Subhra. On Thu, Apr 23, 2015 at 11:57 PM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > HPCX package uses pml "yalla" by default (part of ompi master branch, not > in v1.8). > So, "-mca mtl mxm" has no effect, unless "-mca pml cm" specified to > disable "pml yalla" and let mtl layer to play. > > > > On Fri, Apr 24, 2015 at 6:36 AM, Subhra Mazumdar < > subhramazumd...@gmail.com> wrote: > >> I changed my downloaded MOFED version to match the one installed on the >> node and now the error goes away and it runs fine. But I still have a >> question, I get the exact same performance on all the below 3 cases: >> >> 1) mpirun --allow-run-as-root --mca mtl mxm -mca mtl_mxm_np 0 -x >> MXM_TLS=self,shm,rc,ud -n 1 /root/backend localhost : -x >> LD_PRELOAD=/root/libci.so -n 1 /root/app2 >> >> 2) mpirun --allow-run-as-root --mca mtl mxm -n 1 /root/backend >> localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2 >> >> 3) mpirun --allow-run-as-root --mca mtl ^mxm -n 1 /root/backend >> localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2 >> >> Seems like it doesn't matter if I use mxm, not use mxm or use it with >> reliable connection (RC). How can I be sure I am indeed using mxm over >> infiniband? >> >> Thanks, >> Subhra. >> >> >> >> >> >> On Thu, Apr 23, 2015 at 1:06 AM, Mike Dubman <mi...@dev.mellanox.co.il> >> wrote: >> >>> /usr/bin/ofed_info >>> >>> So, the OFED on your system is not MellanoxOFED 2.4.x but smth else. >>> >>> try #rpm -qi libibverbs >>> >>> >>> On Thu, Apr 23, 2015 at 7:47 AM, Subhra Mazumdar < >>> subhramazumd...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> where is the command ofed_info located? I searched from / but didn't >>>> find it. >>>> >>>> Subhra. >>>> >>>> On Tue, Apr 21, 2015 at 10:43 PM, Mike Dubman <mi...@dev.mellanox.co.il >>>> > wrote: >>>> >>>>> cool, progress! >>>>> >>>>> >>1429676565.124664] sys.c:719 MXM WARN Conflicting CPU >>>>> frequencies detected, using: 2601.00 >>>>> >>>>> means that cpu governor on your machine is not on "performance" mode >>>>> >>>>> >> MXM ERROR ibv_query_device() returned 38: Function not implemented >>>>> >>>>> indicates that ofed installed on your nodes is not indeed 2.4.-1.0.0 >>>>> or there is a mismatch between ofed kernel drivers version and ofed >>>>> userspace libraries version. >>>>> or you have multiple ofed libraries installed on your node and use >>>>> incorrect one. >>>>> could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Apr 22, 2015 at 7:59 AM, Subhra Mazumdar < >>>>> subhramazumd...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I compiled the openmpi that comes inside the mellanox hpcx package >>>>>> with mxm support instead of separately downloaded openmpi. I also used >>>>>> the >>>>>> environment as in the README so that no LD_PRELOAD (except our own >>>>>> library >>>>>> which is unrelated) is needed. Now it runs fine (no segfault) but we get >>>>>> same errors as before (saying initialization of MXM library failed). Is >>>>>> it >>>>>> using MXM successfully? >>>>>> >>>>>> [root@JARVICE >>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# mpirun >>>>>> --allow-run-as-root --mca mtl mxm -n 1 /root/backend localhost : -x >>>>>> LD_PRELOAD=/root/libci.so -n 1 /root/app2 >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> WARNING: a request was made to bind a process. While the system >>>>>> supports binding the process itself, at least one node does NOT >>>>>> support binding memory to the process location. >>>>>> >>>>>> Node: JARVICE >>>>>> >>>>>> This usually is due to not having the required NUMA support installed >>>>>> on the node. In some Linux distributions, the required support is >>>>>> contained in the libnumactl and libnumactl-devel packages. >>>>>> This is a warning only; your job will continue, though performance >>>>>> may be degraded. >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> i am backend >>>>>> [1429676565.121218] sys.c:719 MXM WARN Conflicting CPU >>>>>> frequencies detected, using: 2601.00 >>>>>> [1429676565.122937] [JARVICE:14767:0] ib_dev.c:445 MXM WARN >>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>> [1429676565.122950] [JARVICE:14767:0] ib_dev.c:456 MXM ERROR >>>>>> ibv_query_device() returned 38: Function not implemented >>>>>> [1429676565.123535] [JARVICE:14767:0] ib_dev.c:445 MXM WARN >>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>> [1429676565.123543] [JARVICE:14767:0] ib_dev.c:456 MXM ERROR >>>>>> ibv_query_device() returned 38: Function not implemented >>>>>> [1429676565.124664] sys.c:719 MXM WARN Conflicting CPU >>>>>> frequencies detected, using: 2601.00 >>>>>> [1429676565.126264] [JARVICE:14768:0] ib_dev.c:445 MXM WARN >>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>> [1429676565.126276] [JARVICE:14768:0] ib_dev.c:456 MXM ERROR >>>>>> ibv_query_device() returned 38: Function not implemented >>>>>> [1429676565.126812] [JARVICE:14768:0] ib_dev.c:445 MXM WARN >>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>> [1429676565.126821] [JARVICE:14768:0] ib_dev.c:456 MXM ERROR >>>>>> ibv_query_device() returned 38: Function not implemented >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> Initialization of MXM library failed. >>>>>> >>>>>> Error: Input/output error >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> <application runs fine> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Subhra. >>>>>> >>>>>> >>>>>> On Sat, Apr 18, 2015 at 12:28 AM, Mike Dubman < >>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>> >>>>>>> could you please check that ofed_info -s indeed prints mofed >>>>>>> 2.4-1.0.0? >>>>>>> why LD_PRELOAD needed in your command line? Can you try >>>>>>> >>>>>>> module load hpcx >>>>>>> mpirun -np $np test.exe >>>>>>> ? >>>>>>> >>>>>>> On Sat, Apr 18, 2015 at 8:39 AM, Subhra Mazumdar < >>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>> >>>>>>>> I followed the instructions as in the README, now getting a >>>>>>>> different error: >>>>>>>> >>>>>>>> [root@JARVICE >>>>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# >>>>>>>> ../openmpi-1.8.4/openmpinstall/bin/mpirun --allow-run-as-root --mca >>>>>>>> mtl mxm >>>>>>>> -x LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>>>> ./mxm/lib/libmxm.so.2" -n 1 ../backend localhost : -x >>>>>>>> LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>>>> ./mxm/lib/libmxm.so.2 ../libci.so" -n 1 ../app2 >>>>>>>> >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> WARNING: a request was made to bind a process. While the system >>>>>>>> >>>>>>>> supports binding the process itself, at least one node does NOT >>>>>>>> >>>>>>>> support binding memory to the process location. >>>>>>>> >>>>>>>> Node: JARVICE >>>>>>>> >>>>>>>> This usually is due to not having the required NUMA support >>>>>>>> installed >>>>>>>> >>>>>>>> on the node. In some Linux distributions, the required support is >>>>>>>> >>>>>>>> contained in the libnumactl and libnumactl-devel packages. >>>>>>>> >>>>>>>> This is a warning only; your job will continue, though performance >>>>>>>> may be degraded. >>>>>>>> >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> i am backend >>>>>>>> >>>>>>>> [1429334876.139452] [JARVICE:449 :0] ib_dev.c:445 MXM WARN >>>>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>>>> >>>>>>>> [1429334876.139464] [JARVICE:449 :0] ib_dev.c:456 MXM ERROR >>>>>>>> ibv_query_device() returned 38: Function not implemented >>>>>>>> >>>>>>>> [1429334876.139982] [JARVICE:449 :0] ib_dev.c:445 MXM WARN >>>>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>>>> >>>>>>>> [1429334876.139990] [JARVICE:449 :0] ib_dev.c:456 MXM ERROR >>>>>>>> ibv_query_device() returned 38: Function not implemented >>>>>>>> >>>>>>>> [1429334876.142649] [JARVICE:450 :0] ib_dev.c:445 MXM WARN >>>>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>>>> >>>>>>>> [1429334876.142666] [JARVICE:450 :0] ib_dev.c:456 MXM ERROR >>>>>>>> ibv_query_device() returned 38: Function not implemented >>>>>>>> >>>>>>>> [1429334876.143235] [JARVICE:450 :0] ib_dev.c:445 MXM WARN >>>>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>>>> >>>>>>>> [1429334876.143243] [JARVICE:450 :0] ib_dev.c:456 MXM ERROR >>>>>>>> ibv_query_device() returned 38: Function not implemented >>>>>>>> >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> Initialization of MXM library failed. >>>>>>>> >>>>>>>> Error: Input/output error >>>>>>>> >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> [JARVICE:449 :0] Caught signal 11 (Segmentation fault) >>>>>>>> >>>>>>>> [JARVICE:450 :0] Caught signal 11 (Segmentation fault) >>>>>>>> >>>>>>>> ==== backtrace ==== >>>>>>>> >>>>>>>> 2 0x000000000005640c mxm_handle_error() >>>>>>>> >>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:641 >>>>>>>> >>>>>>>> 3 0x000000000005657c mxm_error_signal_handler() >>>>>>>> >>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:616 >>>>>>>> >>>>>>>> 4 0x00000000000329a0 killpg() ??:0 >>>>>>>> >>>>>>>> 5 0x000000000004812c _IO_vfprintf() ??:0 >>>>>>>> >>>>>>>> 6 0x000000000006f6da vasprintf() ??:0 >>>>>>>> >>>>>>>> 7 0x0000000000059b3b opal_show_help_vstring() ??:0 >>>>>>>> >>>>>>>> 8 0x0000000000026630 orte_show_help() ??:0 >>>>>>>> >>>>>>>> 9 0x0000000000001a3f mca_bml_r2_add_procs() >>>>>>>> >>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/bml/r2/bml_r2.c:409 >>>>>>>> >>>>>>>> 10 0x0000000000004475 mca_pml_ob1_add_procs() >>>>>>>> >>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/pml/ob1/pml_ob1.c:332 >>>>>>>> >>>>>>>> 11 0x00000000000442f3 ompi_mpi_init() ??:0 >>>>>>>> >>>>>>>> 12 0x0000000000067cb0 PMPI_Init_thread() ??:0 >>>>>>>> >>>>>>>> 13 0x000000000000d0ca l_getLocalFromConfig() >>>>>>>> /root/rain_ib/interposer/libciutils.c:83 >>>>>>>> >>>>>>>> 14 0x000000000000c7b4 __cudaRegisterFatBinary() >>>>>>>> /root/rain_ib/interposer/libci.c:4055 >>>>>>>> >>>>>>>> 15 0x0000000000402b59 >>>>>>>> _ZL70__sti____cudaRegisterAll_39_tmpxft_00000703_00000000_6_app2_cpp1_ii_hwv() >>>>>>>> tmpxft_00000703_00000000-3_app2.cudafe1.cpp:0 >>>>>>>> >>>>>>>> 16 0x0000000000402dd6 __do_global_ctors_aux() crtstuff.c:0 >>>>>>>> >>>>>>>> =================== >>>>>>>> >>>>>>>> ==== backtrace ==== >>>>>>>> >>>>>>>> 2 0x000000000005640c mxm_handle_error() >>>>>>>> >>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:641 >>>>>>>> >>>>>>>> 3 0x000000000005657c mxm_error_signal_handler() >>>>>>>> >>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:616 >>>>>>>> >>>>>>>> 4 0x00000000000329a0 killpg() ??:0 >>>>>>>> >>>>>>>> 5 0x000000000004812c _IO_vfprintf() ??:0 >>>>>>>> >>>>>>>> 6 0x000000000006f6da vasprintf() ??:0 >>>>>>>> >>>>>>>> 7 0x0000000000059b3b opal_show_help_vstring() ??:0 >>>>>>>> >>>>>>>> 8 0x0000000000026630 orte_show_help() ??:0 >>>>>>>> >>>>>>>> 9 0x0000000000001a3f mca_bml_r2_add_procs() >>>>>>>> >>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/bml/r2/bml_r2.c:409 >>>>>>>> >>>>>>>> 10 0x0000000000004475 mca_pml_ob1_add_procs() >>>>>>>> >>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/pml/ob1/pml_ob1.c:332 >>>>>>>> >>>>>>>> 11 0x00000000000442f3 ompi_mpi_init() ??:0 >>>>>>>> >>>>>>>> 12 0x0000000000067cb0 PMPI_Init_thread() ??:0 >>>>>>>> >>>>>>>> 13 0x0000000000404fdf main() /root/rain_ib/backend/backend.c:1237 >>>>>>>> >>>>>>>> 14 0x000000000001ed1d __libc_start_main() ??:0 >>>>>>>> >>>>>>>> 15 0x0000000000402db9 _start() ??:0 >>>>>>>> >>>>>>>> =================== >>>>>>>> >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> mpirun noticed that process rank 1 with PID 450 on node JARVICE >>>>>>>> exited on signal 11 (Segmentation fault). >>>>>>>> >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> [JARVICE:00447] 1 more process has sent help message >>>>>>>> help-mtl-mxm.txt / mxm init >>>>>>>> >>>>>>>> [JARVICE:00447] Set MCA parameter "orte_base_help_aggregate" to 0 >>>>>>>> to see all help / error messages >>>>>>>> >>>>>>>> [root@JARVICE >>>>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# >>>>>>>> >>>>>>>> >>>>>>>> Subhra. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Apr 13, 2015 at 10:58 PM, Mike Dubman < >>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>> >>>>>>>>> Have you followed installation steps from README (Also here for >>>>>>>>> reference http://bgate.mellanox.com/products/hpcx/README.txt) >>>>>>>>> >>>>>>>>> ... >>>>>>>>> >>>>>>>>> * Load OpenMPI/OpenSHMEM v1.8 based package: >>>>>>>>> >>>>>>>>> % source $HPCX_HOME/hpcx-init.sh >>>>>>>>> % hpcx_load >>>>>>>>> % env | grep HPCX >>>>>>>>> % mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_usempi >>>>>>>>> % oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem >>>>>>>>> % hpcx_unload >>>>>>>>> >>>>>>>>> 3. Load HPCX environment from modules >>>>>>>>> >>>>>>>>> * Load OpenMPI/OpenSHMEM based package: >>>>>>>>> >>>>>>>>> % module use $HPCX_HOME/modulefiles >>>>>>>>> % module load hpcx >>>>>>>>> % mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c >>>>>>>>> % oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem >>>>>>>>> % module unload hpcx >>>>>>>>> >>>>>>>>> ... >>>>>>>>> >>>>>>>>> On Tue, Apr 14, 2015 at 5:42 AM, Subhra Mazumdar < >>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> I am using 2.4-1.0.0 mellanox ofed. >>>>>>>>>> >>>>>>>>>> I downloaded mofed tarball >>>>>>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5.tar and >>>>>>>>>> extracted >>>>>>>>>> it. It has mxm directory. >>>>>>>>>> >>>>>>>>>> hpcx-v1.2.0-325-[root@JARVICE ~]# ls >>>>>>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5 >>>>>>>>>> archive fca hpcx-init-ompi-mellanox-v1.8.sh ibprof >>>>>>>>>> modulefiles ompi-mellanox-v1.8 sources VERSION >>>>>>>>>> bupc-master hcoll hpcx-init.sh knem >>>>>>>>>> mxm README.txt utils >>>>>>>>>> >>>>>>>>>> I tried using LD_PRELOAD for libmxm, but getting a different >>>>>>>>>> error stack now as following >>>>>>>>>> >>>>>>>>>> [root@JARVICE ~]# ./openmpi-1.8.4/openmpinstall/bin/mpirun >>>>>>>>>> --allow-run-as-root --mca mtl mxm -x >>>>>>>>>> LD_PRELOAD="./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>>>>>> ./hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm/lib/libmxm.so.2" >>>>>>>>>> -n 1 ./backend localhost : -x >>>>>>>>>> LD_PRELOAD="./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>>>>>> ./hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm/lib/libmxm.so.2 >>>>>>>>>> ./libci.so" -n 1 ./app2 >>>>>>>>>> i am backend >>>>>>>>>> [JARVICE:00564] mca: base: components_open: component pml / cm >>>>>>>>>> open function failed >>>>>>>>>> [JARVICE:564 :0] Caught signal 11 (Segmentation fault) >>>>>>>>>> [JARVICE:00565] mca: base: components_open: component pml / cm >>>>>>>>>> open function failed >>>>>>>>>> [JARVICE:565 :0] Caught signal 11 (Segmentation fault) >>>>>>>>>> ==== backtrace ==== >>>>>>>>>> 2 0x000000000005640c mxm_handle_error() >>>>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:641 >>>>>>>>>> 3 0x000000000005657c mxm_error_signal_handler() >>>>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:616 >>>>>>>>>> 4 0x00000000000329a0 killpg() ??:0 >>>>>>>>>> 5 0x0000000000045491 mca_base_components_close() ??:0 >>>>>>>>>> 6 0x000000000004e99a mca_base_framework_close() ??:0 >>>>>>>>>> 7 0x0000000000045431 mca_base_component_close() ??:0 >>>>>>>>>> 8 0x000000000004515c mca_base_framework_components_open() ??:0 >>>>>>>>>> 9 0x00000000000a0de9 mca_pml_base_open() pml_base_frame.c:0 >>>>>>>>>> 10 0x000000000004eb1c mca_base_framework_open() ??:0 >>>>>>>>>> 11 0x0000000000043eb3 ompi_mpi_init() ??:0 >>>>>>>>>> 12 0x0000000000067cb0 PMPI_Init_thread() ??:0 >>>>>>>>>> 13 0x0000000000404fdf main() /root/rain_ib/backend/backend.c:1237 >>>>>>>>>> 14 0x000000000001ed1d __libc_start_main() ??:0 >>>>>>>>>> 15 0x0000000000402db9 _start() ??:0 >>>>>>>>>> =================== >>>>>>>>>> >>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>> A requested component was not found, or was unable to be opened. >>>>>>>>>> This >>>>>>>>>> means that this component is either not installed or is unable to >>>>>>>>>> be >>>>>>>>>> used on your system (e.g., sometimes this means that shared >>>>>>>>>> libraries >>>>>>>>>> that the component requires are unable to be found/loaded). Note >>>>>>>>>> that >>>>>>>>>> Open MPI stopped checking at the first component that it did not >>>>>>>>>> find. >>>>>>>>>> >>>>>>>>>> Host: JARVICE >>>>>>>>>> Framework: mtl >>>>>>>>>> Component: mxm >>>>>>>>>> >>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>> >>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>> mpirun noticed that process rank 0 with PID 564 on node JARVICE >>>>>>>>>> exited on signal 11 (Segmentation fault). >>>>>>>>>> >>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>> [JARVICE:00562] 1 more process has sent help message >>>>>>>>>> help-mca-base.txt / find-available:not-valid >>>>>>>>>> [JARVICE:00562] Set MCA parameter "orte_base_help_aggregate" to 0 >>>>>>>>>> to see all help / error messages >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Subhra >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Apr 12, 2015 at 10:48 PM, Mike Dubman < >>>>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>>>> >>>>>>>>>>> seems like mxm was not found in your ld_library_path. >>>>>>>>>>> >>>>>>>>>>> what mofed version do you use? >>>>>>>>>>> does it have /opt/mellanox/mxm in it? >>>>>>>>>>> You could just run mpirun from HPCX package which looks for mxm >>>>>>>>>>> internally and recompile ompi as mentioned in README. >>>>>>>>>>> >>>>>>>>>>> On Mon, Apr 13, 2015 at 3:24 AM, Subhra Mazumdar < >>>>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I used mxm mtl as follows but getting segfault. It says mxm >>>>>>>>>>>> component not found but I have compiled openmpi with mxm. Any idea >>>>>>>>>>>> what I >>>>>>>>>>>> might be missing? >>>>>>>>>>>> >>>>>>>>>>>> [root@JARVICE ~]# ./openmpi-1.8.4/openmpinstall/bin/mpirun >>>>>>>>>>>> --allow-run-as-root --mca pml cm --mca mtl mxm -n 1 -x >>>>>>>>>>>> LD_PRELOAD=./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 ./backend >>>>>>>>>>>> localhosst : -n 1 -x LD_PRELOAD="./libci.so >>>>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1" ./app2 >>>>>>>>>>>> i am backend >>>>>>>>>>>> [JARVICE:08398] *** Process received signal *** >>>>>>>>>>>> [JARVICE:08398] Signal: Segmentation fault (11) >>>>>>>>>>>> [JARVICE:08398] Signal code: Address not mapped (1) >>>>>>>>>>>> [JARVICE:08398] Failing at address: 0x10 >>>>>>>>>>>> [JARVICE:08398] [ 0] >>>>>>>>>>>> /lib64/libpthread.so.0(+0xf710)[0x7ff8d0ddb710] >>>>>>>>>>>> [JARVICE:08398] [ 1] >>>>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_components_close+0x21)[0x7ff8cf9ae491] >>>>>>>>>>>> [JARVICE:08398] [ 2] >>>>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_close+0x6a)[0x7ff8cf9b799a] >>>>>>>>>>>> [JARVICE:08398] [ 3] >>>>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_component_close+0x21)[0x7ff8cf9ae431] >>>>>>>>>>>> [JARVICE:08398] [ 4] >>>>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_components_open+0x11c)[0x7ff8cf9ae15c] >>>>>>>>>>>> [JARVICE:08398] [ 5] >>>>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(+0xa0de9)[0x7ff8d1089de9] >>>>>>>>>>>> [JARVICE:08398] [ 6] >>>>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7ff8cf9b7b1c] >>>>>>>>>>>> [JARVICE:08398] [ 7] [JARVICE:08398] mca: base: >>>>>>>>>>>> components_open: component pml / cm open function failed >>>>>>>>>>>> >>>>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(ompi_mpi_init+0x4b3)[0x7ff8d102ceb3] >>>>>>>>>>>> [JARVICE:08398] [ 8] >>>>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(PMPI_Init_thread+0x100)[0x7ff8d1050cb0] >>>>>>>>>>>> [JARVICE:08398] [ 9] ./backend[0x404fdf] >>>>>>>>>>>> [JARVICE:08398] [10] >>>>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff8cfeded1d] >>>>>>>>>>>> [JARVICE:08398] [11] ./backend[0x402db9] >>>>>>>>>>>> [JARVICE:08398] *** End of error message *** >>>>>>>>>>>> >>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>> A requested component was not found, or was unable to be >>>>>>>>>>>> opened. This >>>>>>>>>>>> means that this component is either not installed or is unable >>>>>>>>>>>> to be >>>>>>>>>>>> used on your system (e.g., sometimes this means that shared >>>>>>>>>>>> libraries >>>>>>>>>>>> that the component requires are unable to be found/loaded). >>>>>>>>>>>> Note that >>>>>>>>>>>> Open MPI stopped checking at the first component that it did >>>>>>>>>>>> not find. >>>>>>>>>>>> >>>>>>>>>>>> Host: JARVICE >>>>>>>>>>>> Framework: mtl >>>>>>>>>>>> Component: mxm >>>>>>>>>>>> >>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>> mpirun noticed that process rank 0 with PID 8398 on node >>>>>>>>>>>> JARVICE exited on signal 11 (Segmentation fault). >>>>>>>>>>>> >>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Subhra. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Apr 10, 2015 at 12:12 AM, Mike Dubman < >>>>>>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> no need IPoIB, mxm uses native IB. >>>>>>>>>>>>> >>>>>>>>>>>>> Please see HPCX (pre-compiled ompi, integrated with MXM and >>>>>>>>>>>>> FCA) README file for details how to compile/select. >>>>>>>>>>>>> >>>>>>>>>>>>> The default transport is UD for internode communication and >>>>>>>>>>>>> shared-memory for intra-node. >>>>>>>>>>>>> >>>>>>>>>>>>> http://bgate,mellanox.com/products/hpcx/ >>>>>>>>>>>>> >>>>>>>>>>>>> Also, mxm included in the Mellanox OFED. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Apr 10, 2015 at 5:26 AM, Subhra Mazumdar < >>>>>>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Does ipoib need to be configured on the ib cards for mxm (I >>>>>>>>>>>>>> have a separate ethernet connection too)? Also are there special >>>>>>>>>>>>>> flags in >>>>>>>>>>>>>> mpirun to select from UD/RC/DC? What is the default? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Subhra. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Mar 31, 2015 at 9:46 AM, Mike Dubman < >>>>>>>>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> mxm uses IB rdma/roce technologies. Once can select UD/RC/DC >>>>>>>>>>>>>>> transports to be used in mxm. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> By selecting mxm, all MPI p2p routines will be mapped to >>>>>>>>>>>>>>> appropriate mxm functions. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> M >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Mar 30, 2015 at 7:32 PM, Subhra Mazumdar < >>>>>>>>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi MIke, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Does the mxm mtl use infiniband rdma? Also from programming >>>>>>>>>>>>>>>> perspective, do I need to use anything else other than >>>>>>>>>>>>>>>> MPI_Send/MPI_Recv? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Subhra. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sun, Mar 29, 2015 at 11:14 PM, Mike Dubman < >>>>>>>>>>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> openib btl does not support this thread model. >>>>>>>>>>>>>>>>> You can use OMPI w/ mxm (-mca mtl mxm) and multiple thread >>>>>>>>>>>>>>>>> mode lin 1.8 x series or (-mca pml yalla) in the master >>>>>>>>>>>>>>>>> branch. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> M >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, Mar 30, 2015 at 9:09 AM, Subhra Mazumdar < >>>>>>>>>>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Can MPI_THREAD_MULTIPLE and openib btl work together in >>>>>>>>>>>>>>>>>> open mpi 1.8.4? If so are there any command line options >>>>>>>>>>>>>>>>>> needed during run >>>>>>>>>>>>>>>>>> time? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Subhra. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26574.php >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Kind Regards, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> M. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26575.php >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26580.php >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Kind Regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> M. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26584.php >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26663.php >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> >>>>>>>>>>>>> Kind Regards, >>>>>>>>>>>>> >>>>>>>>>>>>> M. >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> Subscription: >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26665.php >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> Subscription: >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26686.php >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> Kind Regards, >>>>>>>>>>> >>>>>>>>>>> M. >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26688.php >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26711.php >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Kind Regards, >>>>>>>>> >>>>>>>>> M. >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26712.php >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26752.php >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Kind Regards, >>>>>>> >>>>>>> M. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26754.php >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26761.php >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Kind Regards, >>>>> >>>>> M. >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2015/04/26762.php >>>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/04/26766.php >>>> >>> >>> >>> >>> -- >>> >>> Kind Regards, >>> >>> M. >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/04/26768.php >>> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/04/26777.php >> > > > > -- > > Kind Regards, > > M. > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26779.php >