I see what you mean. Below is the output (filtered for a single host). Our setup is very generic.
Dell SOS6320 hosts (haswell) Mellanox connectx-3 HCAs (mlx4 drivers - native RHEL, not mofed). FDR/EDR switches (stand-alone opensm) RHEL7.4 slurm 16.05.11 pmix (pmix-1.1.5-1.el7.x86_64) openmpi (3.0.0, 3.1.0) Apps include the well known, LAMMPS, VASP, GROMACS, amber, raxml, espresso, namd2, (i.e. the usual list of research university apps). gadget/gizmo/arepo are really the only ones giving us trouble but I know they run fine under both openmpi and impi/mpich/mvapich at other sites. I’m trying to figure out why we can’t seem to run it reliably but I’d also like to get up-to-date with our transport API’s. Seems we’ve fallen behind and are just doing the things we’ve always done (openib BTL). I’ll try running with modified “provider_include” list and see what happens. The fi_info output shows the verbs, udp, and sockets providers. Thanks, Charlie [chasman@login4 mufasa]$ grep 'c29a-s2.ufhpc' mz0.e [c29a-s2.ufhpc:01463] mca: base: components_register: registering framework mtl components [c29a-s2.ufhpc:01463] mca: base: components_register: found loaded component ofi [c29a-s2.ufhpc:01463] mca: base: components_register: component ofi register function successful [c29a-s2.ufhpc:01463] mca: base: components_open: opening mtl components [c29a-s2.ufhpc:01463] mca: base: components_open: found loaded component ofi [c29a-s2.ufhpc:01463] mca: base: components_open: component ofi open function successful [c29a-s2.ufhpc:01464] mca: base: components_register: registering framework mtl components [c29a-s2.ufhpc:01464] mca: base: components_register: found loaded component ofi [c29a-s2.ufhpc:01464] mca: base: components_register: component ofi register function successful [c29a-s2.ufhpc:01464] mca: base: components_open: opening mtl components [c29a-s2.ufhpc:01464] mca: base: components_open: found loaded component ofi [c29a-s2.ufhpc:01464] mca: base: components_open: component ofi open function successful [c29a-s2.ufhpc:01465] mca: base: components_register: registering framework mtl components [c29a-s2.ufhpc:01465] mca: base: components_register: found loaded component ofi [c29a-s2.ufhpc:01465] mca: base: components_register: component ofi register function successful [c29a-s2.ufhpc:01465] mca: base: components_open: opening mtl components [c29a-s2.ufhpc:01465] mca: base: components_open: found loaded component ofi [c29a-s2.ufhpc:01465] mca: base: components_open: component ofi open function successful [c29a-s2.ufhpc:01466] mca: base: components_register: registering framework mtl components [c29a-s2.ufhpc:01466] mca: base: components_register: found loaded component ofi [c29a-s2.ufhpc:01466] mca: base: components_register: component ofi register function successful [c29a-s2.ufhpc:01466] mca: base: components_open: opening mtl components [c29a-s2.ufhpc:01466] mca: base: components_open: found loaded component ofi [c29a-s2.ufhpc:01466] mca: base: components_open: component ofi open function successful [c29a-s2.ufhpc:01463] mca:base:select: Auto-selecting mtl components [c29a-s2.ufhpc:01463] mca:base:select:( mtl) Querying component [ofi] [c29a-s2.ufhpc:01463] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [c29a-s2.ufhpc:01463] mca:base:select:( mtl) Selected component [ofi] [c29a-s2.ufhpc:01463] select: initializing mtl component ofi [c29a-s2.ufhpc:01464] mca:base:select: Auto-selecting mtl components [c29a-s2.ufhpc:01464] mca:base:select:( mtl) Querying component [ofi] [c29a-s2.ufhpc:01464] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [c29a-s2.ufhpc:01464] mca:base:select:( mtl) Selected component [ofi] [c29a-s2.ufhpc:01464] select: initializing mtl component ofi [c29a-s2.ufhpc:01465] mca:base:select: Auto-selecting mtl components [c29a-s2.ufhpc:01465] mca:base:select:( mtl) Querying component [ofi] [c29a-s2.ufhpc:01465] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [c29a-s2.ufhpc:01465] mca:base:select:( mtl) Selected component [ofi] [c29a-s2.ufhpc:01465] select: initializing mtl component ofi [c29a-s2.ufhpc:01466] mca:base:select: Auto-selecting mtl components [c29a-s2.ufhpc:01466] mca:base:select:( mtl) Querying component [ofi] [c29a-s2.ufhpc:01466] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [c29a-s2.ufhpc:01466] mca:base:select:( mtl) Selected component [ofi] [c29a-s2.ufhpc:01466] select: initializing mtl component ofi [c29a-s2.ufhpc:01464] mtl_ofi_component.c:269: mtl:ofi:provider_include = "psm,psm2,gni" [c29a-s2.ufhpc:01464] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = "(null)" [c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include list [c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01464] mtl_ofi_component.c:301: mtl:ofi:prov: none [c29a-s2.ufhpc:01464] mtl_ofi_component.c:410: select_ofi_provider: no provider found [c29a-s2.ufhpc:01464] select: init returned failure for component ofi [c29a-s2.ufhpc:01464] select: no component selected [c29a-s2.ufhpc:01464] mca: base: close: component ofi closed [c29a-s2.ufhpc:01464] mca: base: close: unloading component ofi [c29a-s2.ufhpc:01465] mtl_ofi_component.c:269: mtl:ofi:provider_include = "psm,psm2,gni" [c29a-s2.ufhpc:01465] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = "(null)" [c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include list [c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01465] mtl_ofi_component.c:301: mtl:ofi:prov: none [c29a-s2.ufhpc:01465] mtl_ofi_component.c:410: select_ofi_provider: no provider found [c29a-s2.ufhpc:01465] select: init returned failure for component ofi [c29a-s2.ufhpc:01465] select: no component selected [c29a-s2.ufhpc:01465] mca: base: close: component ofi closed [c29a-s2.ufhpc:01465] mca: base: close: unloading component ofi [c29a-s2.ufhpc:01463] mtl_ofi_component.c:269: mtl:ofi:provider_include = "psm,psm2,gni" [c29a-s2.ufhpc:01463] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = "(null)" [c29a-s2.ufhpc:01466] mtl_ofi_component.c:269: mtl:ofi:provider_include = "psm,psm2,gni" [c29a-s2.ufhpc:01466] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = "(null)" [c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include list [c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include list [c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list [c29a-s2.ufhpc:01466] mtl_ofi_component.c:301: mtl:ofi:prov: none [c29a-s2.ufhpc:01466] mtl_ofi_component.c:410: select_ofi_provider: no provider found [c29a-s2.ufhpc:01463] mtl_ofi_component.c:301: mtl:ofi:prov: none [c29a-s2.ufhpc:01463] mtl_ofi_component.c:410: select_ofi_provider: no provider found [c29a-s2.ufhpc:01463] select: init returned failure for component ofi [c29a-s2.ufhpc:01463] select: no component selected [c29a-s2.ufhpc:01466] select: init returned failure for component ofi [c29a-s2.ufhpc:01466] select: no component selected [c29a-s2.ufhpc:01466] mca: base: close: component ofi closed [c29a-s2.ufhpc:01466] mca: base: close: unloading component ofi [c29a-s2.ufhpc:01463] mca: base: close: component ofi closed [c29a-s2.ufhpc:01463] mca: base: close: unloading component ofi > On Jun 14, 2018, at 7:48 AM, Howard Pritchard <hpprit...@gmail.com> wrote: > > Hello Charles > > You are heading in the right direction. > > First you might want to run the libfabric fi_info command to see what > capabilities you picked up from the libfabric RPMs. > > Next you may well not actually be using the OFI mtl. > > Could you run your app with > > export OMPI_MCA_mtl_base_verbose=100 > > and post the output? > > It would also help if you described the system you are using : OS > interconnect cpu type etc. > > Howard > > Charles A Taylor <chas...@ufl.edu <mailto:chas...@ufl.edu>> schrieb am Do. > 14. Juni 2018 um 06:36: > Because of the issues we are having with OpenMPI and the openib BTL > (questions previously asked), I’ve been looking into what other transports > are available. I was particularly interested in OFI/libfabric support but > cannot find any information on it more recent than a reference to the usNIC > BTL from 2015 (Jeff Squyres, Cisco). Unfortunately, the openmpi-org website > FAQ’s covering OpenFabrics support don’t mention anything beyond OpenMPI 1.8. > Given that 3.1 is the current stable version, that seems odd. > > That being the case, I thought I’d ask here. After laying down the > libfabric-devel RPM and building (3.1.0) with —with-libfabric=/usr, I end up > with an “ofi” MTL but nothing else. I can run with OMPI_MCA_mtl=ofi and > OMPI_MCA_btl=“self,vader,openib” but it eventually crashes in libopen-pal.so. > (mpi_waitall() higher up the stack). > > GIZMO:9185 terminated with signal 11 at PC=2b4d4b68a91d SP=7ffcfbde9ff0. > Backtrace: > /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(+0x9391d)[0x2b4d4b68a91d] > /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(opal_progress+0x24)[0x2b4d4b632754] > /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(ompi_request_default_wait_all+0x11f)[0x2b4d47be2a6f] > /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(PMPI_Waitall+0xbd)[0x2b4d47c2ce4d] > > Questions: Am I using the OFI MTL as intended? Should there be an “ofi” > BTL? Does anyone use this? > > Thanks, > > Charlie Taylor > UF Research Computing > > PS - If you could use some help updating the FAQs, I’d be willing to put in > some time. I’d probably learn a lot. > _______________________________________________ > users mailing list > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=8sBODgXZKw_dNqkFqkTqbGD3_7nNlm_pat-D6AqiaC8&m=EGR5U297e0v1wN5gzlnqAsj7sHLpSN3I_tjwpfbJQAI&s=k64is7lySeSVrkP8ys8ZIVuVHRY6VJpxBEXU1dXczAY&e= > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwMFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=nOFQDWuhmU9qhe6be-0JeNMGn1q64kJj0nWQV-vZg7k&s=PoOVfxkE7rR9spMSFabAs8TokTpgbCIyJRGuWTf5jIk&e=>_______________________________________________ > users mailing list > users@lists.open-mpi.org > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=nOFQDWuhmU9qhe6be-0JeNMGn1q64kJj0nWQV-vZg7k&s=PoOVfxkE7rR9spMSFabAs8TokTpgbCIyJRGuWTf5jIk&e=
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users