Dear Community,

I have 4 computing nodes and front-end. Computing nodes connected via
IB and Ethernet and fron-end has Ethernet only. Computing node has 4
CPU on board, each CPU has 16 cores, total number of cores per node is
64. The IB network controller is Mellanox MT26428, IB switch is Qlogic
12000. I installed Rock Cluster Linux 6.1 on my cluster, and this
system has OpenMPI from the box. ompi_info gives the version of OpenMP
is 1.6.2

Good news that IB and OpendMPI with IB support working from the box,
but I face a really strange bug. If I try to run HelloWord on more
then 129 processes with IB support it gives me Segmentation fault
error. I have this error even if I try to start it on one node. Below
129 processes everything is working fine, on 1 node or on 4 nodes
except warning messages (listing bellow). Without IB support
everything working fine on arbitrary number of processes.

Anybody have an idea regarding my issue?

Thank you.


Warning messages for HelloWord, running on two processes (each process
per node):

 mpirun --mca btl_openib_verbose 1 --mca btl ^tcp -hostfile
machinefile -n 2 a.out
[compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
Querying INI files for vendor 0x02c9, part ID 26428
[compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
Found corresponding INI values: Mellanox Hermon
[compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
Querying INI files for vendor 0x0000, part ID 0
[compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
Found corresponding INI values: default
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host
'compute-0-0.local', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
[compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
Querying INI files for vendor 0x02c9, part ID 26428
[compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
Found corresponding INI values: Mellanox Hermon
[compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
Querying INI files for vendor 0x0000, part ID 0
[compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
Found corresponding INI values: default
--------------------------------------------------------------------------
WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--------------------------------------------------------------------------
compute-0-0.local:20229:  open_hca: getaddr_netdev ERROR: Success. Is
ib1 configured?
compute-0-0.local:20229:  open_hca: device mthca0 not found
compute-0-0.local:20229:  open_hca: device mthca0 not found
compute-0-1.local:58701:  open_hca: getaddr_netdev ERROR: Success. Is
ib1 configured?
compute-0-1.local:58701:  open_hca: device mthca0 not found
compute-0-1.local:58701:  open_hca: device mthca0 not found
DAT: library load failure: libdaplscm.so.2: cannot open shared object
file: No such file or directory
DAT: library load failure: libdaplscm.so.2: cannot open shared object
file: No such file or directory
DAT: library load failure: libdaplscm.so.2: cannot open shared object
file: No such file or directory
DAT: library load failure: libdaplscm.so.2: cannot open shared object
file: No such file or directory
[compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
Querying INI files for vendor 0x02c9, part ID 26428
[compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
Found corresponding INI values: Mellanox Hermon
[compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
Querying INI files for vendor 0x02c9, part ID 26428
[compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
Found corresponding INI values: Mellanox Hermon
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.


See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              compute-0-0.local
  Registerable memory:     32768 MiB
  Total memory:            262125 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------
[compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
Querying INI files for vendor 0x02c9, part ID 26428
[compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
Found corresponding INI values: Mellanox Hermon
[compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
Querying INI files for vendor 0x02c9, part ID 26428
[compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
Found corresponding INI values: Mellanox Hermon
Hello world from process 0 of 2
Hello world from process 1 of 2
[compute-0-0.local:20227] 1 more process has sent help message
help-mpi-btl-openib.txt / default subnet prefix
[compute-0-0.local:20227] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages
[compute-0-0.local:20227] 9 more processes have sent help message
help-mpi-btl-udapl.txt / dat_ia_open fail
[compute-0-0.local:20227] 3 more processes have sent help message
help-mpi-btl-openib.txt / reg mem limit low

Reply via email to