I see what you mean.  Below is the output (filtered for a single host). Our 
setup is very generic.

Dell SOS6320 hosts (haswell)
Mellanox connectx-3 HCAs (mlx4 drivers - native RHEL, not mofed).
FDR/EDR switches (stand-alone opensm)
RHEL7.4
slurm 16.05.11
pmix (pmix-1.1.5-1.el7.x86_64)
openmpi (3.0.0, 3.1.0)

Apps include the well known, LAMMPS, VASP, GROMACS, amber,  raxml, espresso, 
namd2, (i.e. the usual list of research university apps).
gadget/gizmo/arepo are really the only ones giving us trouble but I know they 
run fine under both openmpi and impi/mpich/mvapich at other sites.  I’m trying 
to figure out why we can’t seem to run it reliably but I’d also like to get 
up-to-date with our transport API’s.  Seems we’ve fallen behind and are just 
doing the things we’ve always done (openib BTL).

I’ll try running with modified “provider_include” list and see what happens.  
The fi_info output shows the verbs, udp, and sockets providers.

Thanks,

Charlie


[chasman@login4 mufasa]$ grep 'c29a-s2.ufhpc' mz0.e 
[c29a-s2.ufhpc:01463] mca: base: components_register: registering framework mtl 
components
[c29a-s2.ufhpc:01463] mca: base: components_register: found loaded component ofi
[c29a-s2.ufhpc:01463] mca: base: components_register: component ofi register 
function successful
[c29a-s2.ufhpc:01463] mca: base: components_open: opening mtl components
[c29a-s2.ufhpc:01463] mca: base: components_open: found loaded component ofi
[c29a-s2.ufhpc:01463] mca: base: components_open: component ofi open function 
successful
[c29a-s2.ufhpc:01464] mca: base: components_register: registering framework mtl 
components
[c29a-s2.ufhpc:01464] mca: base: components_register: found loaded component ofi
[c29a-s2.ufhpc:01464] mca: base: components_register: component ofi register 
function successful
[c29a-s2.ufhpc:01464] mca: base: components_open: opening mtl components
[c29a-s2.ufhpc:01464] mca: base: components_open: found loaded component ofi
[c29a-s2.ufhpc:01464] mca: base: components_open: component ofi open function 
successful
[c29a-s2.ufhpc:01465] mca: base: components_register: registering framework mtl 
components
[c29a-s2.ufhpc:01465] mca: base: components_register: found loaded component ofi
[c29a-s2.ufhpc:01465] mca: base: components_register: component ofi register 
function successful
[c29a-s2.ufhpc:01465] mca: base: components_open: opening mtl components
[c29a-s2.ufhpc:01465] mca: base: components_open: found loaded component ofi
[c29a-s2.ufhpc:01465] mca: base: components_open: component ofi open function 
successful
[c29a-s2.ufhpc:01466] mca: base: components_register: registering framework mtl 
components
[c29a-s2.ufhpc:01466] mca: base: components_register: found loaded component ofi
[c29a-s2.ufhpc:01466] mca: base: components_register: component ofi register 
function successful
[c29a-s2.ufhpc:01466] mca: base: components_open: opening mtl components
[c29a-s2.ufhpc:01466] mca: base: components_open: found loaded component ofi
[c29a-s2.ufhpc:01466] mca: base: components_open: component ofi open function 
successful
[c29a-s2.ufhpc:01463] mca:base:select: Auto-selecting mtl components
[c29a-s2.ufhpc:01463] mca:base:select:(  mtl) Querying component [ofi]
[c29a-s2.ufhpc:01463] mca:base:select:(  mtl) Query of component [ofi] set 
priority to 25
[c29a-s2.ufhpc:01463] mca:base:select:(  mtl) Selected component [ofi]
[c29a-s2.ufhpc:01463] select: initializing mtl component ofi
[c29a-s2.ufhpc:01464] mca:base:select: Auto-selecting mtl components
[c29a-s2.ufhpc:01464] mca:base:select:(  mtl) Querying component [ofi]
[c29a-s2.ufhpc:01464] mca:base:select:(  mtl) Query of component [ofi] set 
priority to 25
[c29a-s2.ufhpc:01464] mca:base:select:(  mtl) Selected component [ofi]
[c29a-s2.ufhpc:01464] select: initializing mtl component ofi
[c29a-s2.ufhpc:01465] mca:base:select: Auto-selecting mtl components
[c29a-s2.ufhpc:01465] mca:base:select:(  mtl) Querying component [ofi]
[c29a-s2.ufhpc:01465] mca:base:select:(  mtl) Query of component [ofi] set 
priority to 25
[c29a-s2.ufhpc:01465] mca:base:select:(  mtl) Selected component [ofi]
[c29a-s2.ufhpc:01465] select: initializing mtl component ofi
[c29a-s2.ufhpc:01466] mca:base:select: Auto-selecting mtl components
[c29a-s2.ufhpc:01466] mca:base:select:(  mtl) Querying component [ofi]
[c29a-s2.ufhpc:01466] mca:base:select:(  mtl) Query of component [ofi] set 
priority to 25
[c29a-s2.ufhpc:01466] mca:base:select:(  mtl) Selected component [ofi]
[c29a-s2.ufhpc:01466] select: initializing mtl component ofi
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:269: mtl:ofi:provider_include = 
"psm,psm2,gni"
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = 
"(null)"
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include 
list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:410: select_ofi_provider: no provider 
found
[c29a-s2.ufhpc:01464] select: init returned failure for component ofi
[c29a-s2.ufhpc:01464] select: no component selected
[c29a-s2.ufhpc:01464] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01464] mca: base: close: unloading component ofi
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:269: mtl:ofi:provider_include = 
"psm,psm2,gni"
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = 
"(null)"
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include 
list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:410: select_ofi_provider: no provider 
found
[c29a-s2.ufhpc:01465] select: init returned failure for component ofi
[c29a-s2.ufhpc:01465] select: no component selected
[c29a-s2.ufhpc:01465] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01465] mca: base: close: unloading component ofi
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:269: mtl:ofi:provider_include = 
"psm,psm2,gni"
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = 
"(null)"
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:269: mtl:ofi:provider_include = 
"psm,psm2,gni"
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = 
"(null)"
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include 
list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include 
list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:410: select_ofi_provider: no provider 
found
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:410: select_ofi_provider: no provider 
found
[c29a-s2.ufhpc:01463] select: init returned failure for component ofi
[c29a-s2.ufhpc:01463] select: no component selected
[c29a-s2.ufhpc:01466] select: init returned failure for component ofi
[c29a-s2.ufhpc:01466] select: no component selected
[c29a-s2.ufhpc:01466] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01466] mca: base: close: unloading component ofi
[c29a-s2.ufhpc:01463] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01463] mca: base: close: unloading component ofi


> On Jun 14, 2018, at 7:48 AM, Howard Pritchard <hpprit...@gmail.com> wrote:
> 
> Hello Charles
> 
> You are heading in the right direction.
> 
> First you might want to run the libfabric fi_info command to see what 
> capabilities you picked up from the libfabric RPMs.
> 
> Next you may well not actually be using the OFI  mtl.
> 
> Could you run your app with
> 
> export OMPI_MCA_mtl_base_verbose=100
> 
> and post the output?
> 
> It would also help if you described the system you are using :  OS 
> interconnect cpu type etc. 
> 
> Howard
> 
> Charles A Taylor <chas...@ufl.edu <mailto:chas...@ufl.edu>> schrieb am Do. 
> 14. Juni 2018 um 06:36:
> Because of the issues we are having with OpenMPI and the openib BTL 
> (questions previously asked), I’ve been looking into what other transports 
> are available.  I was particularly interested in OFI/libfabric support but 
> cannot find any information on it more recent than a reference to the usNIC 
> BTL from 2015 (Jeff Squyres, Cisco).  Unfortunately, the openmpi-org website 
> FAQ’s covering OpenFabrics support don’t mention anything beyond OpenMPI 1.8. 
>  Given that 3.1 is the current stable version, that seems odd.
> 
> That being the case, I thought I’d ask here. After laying down the 
> libfabric-devel RPM and building (3.1.0) with —with-libfabric=/usr, I end up 
> with an “ofi” MTL but nothing else.   I can run with OMPI_MCA_mtl=ofi and 
> OMPI_MCA_btl=“self,vader,openib” but it eventually crashes in libopen-pal.so. 
>   (mpi_waitall() higher up the stack).
> 
> GIZMO:9185 terminated with signal 11 at PC=2b4d4b68a91d SP=7ffcfbde9ff0.  
> Backtrace:
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(+0x9391d)[0x2b4d4b68a91d]
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(opal_progress+0x24)[0x2b4d4b632754]
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(ompi_request_default_wait_all+0x11f)[0x2b4d47be2a6f]
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(PMPI_Waitall+0xbd)[0x2b4d47c2ce4d]
> 
> Questions: Am I using the OFI MTL as intended?   Should there be an “ofi” 
> BTL?   Does anyone use this?
> 
> Thanks,
> 
> Charlie Taylor
> UF Research Computing
> 
> PS - If you could use some help updating the FAQs, I’d be willing to put in 
> some time.  I’d probably learn a lot.
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=8sBODgXZKw_dNqkFqkTqbGD3_7nNlm_pat-D6AqiaC8&m=EGR5U297e0v1wN5gzlnqAsj7sHLpSN3I_tjwpfbJQAI&s=k64is7lySeSVrkP8ys8ZIVuVHRY6VJpxBEXU1dXczAY&e=
>  
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwMFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=nOFQDWuhmU9qhe6be-0JeNMGn1q64kJj0nWQV-vZg7k&s=PoOVfxkE7rR9spMSFabAs8TokTpgbCIyJRGuWTf5jIk&e=>_______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=nOFQDWuhmU9qhe6be-0JeNMGn1q64kJj0nWQV-vZg7k&s=PoOVfxkE7rR9spMSFabAs8TokTpgbCIyJRGuWTf5jIk&e=

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to