Hello, Thank you for your answer.
But below 129 code runs well even with these warnings. I have following warnings: 1. WARNING: There are more than one active ports on host 'compute-0-0.local' 2. WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:]. 3. open_hca: getaddr_netdev ERROR: Success. Is ib1 configured? 4. open_hca: device mthca0 not found 5. library load failure: libdaplscm.so.2: cannot open shared object file: No such file or directory 6. WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. It's looks messy, but these warnings are not critical, or I'm not right ? Warnings 1-3 denote the unconfigured second port, should I configure network interface for it ? I tried to solve warning 6 by manual, but it does not work. What about warning 5, what is this library libdaplscm.so.2? How do you think, may be it would be better to install MLNX_OFED on my nodes as I have NIC from this brand ? Regards, Svyatoslav On Mon, Mar 11, 2013 at 4:04 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Did the check the FAQ entries listed on all the warning messages that you're > getting? You should probably fix those first. > > Sent from my phone. No type good. > > On Mar 10, 2013, at 4:30 AM, "Svyatoslav Korneev" > <svyatoslav.korn...@gmail.com> wrote: > >> Dear Community, >> >> I have 4 computing nodes and front-end. Computing nodes connected via >> IB and Ethernet and fron-end has Ethernet only. Computing node has 4 >> CPU on board, each CPU has 16 cores, total number of cores per node is >> 64. The IB network controller is Mellanox MT26428, IB switch is Qlogic >> 12000. I installed Rock Cluster Linux 6.1 on my cluster, and this >> system has OpenMPI from the box. ompi_info gives the version of OpenMP >> is 1.6.2 >> >> Good news that IB and OpendMPI with IB support working from the box, >> but I face a really strange bug. If I try to run HelloWord on more >> then 129 processes with IB support it gives me Segmentation fault >> error. I have this error even if I try to start it on one node. Below >> 129 processes everything is working fine, on 1 node or on 4 nodes >> except warning messages (listing bellow). Without IB support >> everything working fine on arbitrary number of processes. >> >> Anybody have an idea regarding my issue? >> >> Thank you. >> >> >> Warning messages for HelloWord, running on two processes (each process >> per node): >> >> mpirun --mca btl_openib_verbose 1 --mca btl ^tcp -hostfile >> machinefile -n 2 a.out >> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] >> Querying INI files for vendor 0x02c9, part ID 26428 >> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] >> Found corresponding INI values: Mellanox Hermon >> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] >> Querying INI files for vendor 0x0000, part ID 0 >> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] >> Found corresponding INI values: default >> -------------------------------------------------------------------------- >> WARNING: There are more than one active ports on host >> 'compute-0-0.local', but the >> default subnet GID prefix was detected on more than one of these >> ports. If these ports are connected to different physical IB >> networks, this configuration will fail in Open MPI. This version of >> Open MPI requires that every physically separate IB subnet that is >> used between connected MPI processes must have different subnet ID >> values. >> >> Please see this FAQ entry for more details: >> >> http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid >> >> NOTE: You can turn off this warning by setting the MCA parameter >> btl_openib_warn_default_gid_prefix to 0. >> -------------------------------------------------------------------------- >> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] >> Querying INI files for vendor 0x02c9, part ID 26428 >> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] >> Found corresponding INI values: Mellanox Hermon >> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] >> Querying INI files for vendor 0x0000, part ID 0 >> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] >> Found corresponding INI values: default >> -------------------------------------------------------------------------- >> WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:]. >> This may be a real error or it may be an invalid entry in the uDAPL >> Registry which is contained in the dat.conf file. Contact your local >> System Administrator to confirm the availability of the interfaces in >> the dat.conf file. >> -------------------------------------------------------------------------- >> compute-0-0.local:20229: open_hca: getaddr_netdev ERROR: Success. Is >> ib1 configured? >> compute-0-0.local:20229: open_hca: device mthca0 not found >> compute-0-0.local:20229: open_hca: device mthca0 not found >> compute-0-1.local:58701: open_hca: getaddr_netdev ERROR: Success. Is >> ib1 configured? >> compute-0-1.local:58701: open_hca: device mthca0 not found >> compute-0-1.local:58701: open_hca: device mthca0 not found >> DAT: library load failure: libdaplscm.so.2: cannot open shared object >> file: No such file or directory >> DAT: library load failure: libdaplscm.so.2: cannot open shared object >> file: No such file or directory >> DAT: library load failure: libdaplscm.so.2: cannot open shared object >> file: No such file or directory >> DAT: library load failure: libdaplscm.so.2: cannot open shared object >> file: No such file or directory >> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] >> Querying INI files for vendor 0x02c9, part ID 26428 >> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] >> Found corresponding INI values: Mellanox Hermon >> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] >> Querying INI files for vendor 0x02c9, part ID 26428 >> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] >> Found corresponding INI values: Mellanox Hermon >> -------------------------------------------------------------------------- >> WARNING: It appears that your OpenFabrics subsystem is configured to only >> allow registering part of your physical memory. This can cause MPI jobs to >> run with erratic performance, hang, and/or crash. >> >> This may be caused by your OpenFabrics vendor limiting the amount of >> physical memory that can be registered. You should investigate the >> relevant Linux kernel module parameters that control how much physical >> memory can be registered, and increase them to allow registering all >> physical memory on your machine. >> >> >> See this Open MPI FAQ item for more information on these Linux kernel module >> parameters: >> >> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages >> >> Local host: compute-0-0.local >> Registerable memory: 32768 MiB >> Total memory: 262125 MiB >> >> Your MPI job will continue, but may be behave poorly and/or hang. >> -------------------------------------------------------------------------- >> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] >> Querying INI files for vendor 0x02c9, part ID 26428 >> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] >> Found corresponding INI values: Mellanox Hermon >> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] >> Querying INI files for vendor 0x02c9, part ID 26428 >> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] >> Found corresponding INI values: Mellanox Hermon >> Hello world from process 0 of 2 >> Hello world from process 1 of 2 >> [compute-0-0.local:20227] 1 more process has sent help message >> help-mpi-btl-openib.txt / default subnet prefix >> [compute-0-0.local:20227] Set MCA parameter "orte_base_help_aggregate" >> to 0 to see all help / error messages >> [compute-0-0.local:20227] 9 more processes have sent help message >> help-mpi-btl-udapl.txt / dat_ia_open fail >> [compute-0-0.local:20227] 3 more processes have sent help message >> help-mpi-btl-openib.txt / reg mem limit low >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users