Don't include udapl - that code may well be stale

Sent from my iPhone

On Jun 23, 2013, at 3:42 AM, dani <d...@letai.org.il> wrote:

> Hi,
> 
> I've encountered strange issues when trying to run a simple mpi job on a 
> single host which has IB.
> The complete errors:
> 
>> -> mpirun -n 1 hello
>> --------------------------------------------------------------------------
>> WARNING: Failed to open "ofa-v2-mlx4_0-1" 
>> [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. 
>> This may be a real error or it may be an invalid entry in the uDAPL
>> Registry which is contained in the dat.conf file. Contact your local
>> System Administrator to confirm the availability of the interfaces in
>> the dat.conf file.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> [[53031,1],0]: A high-performance Open MPI point-to-point messaging module
>> was unable to find any relevant network interfaces:
>> 
>> Module: uDAPL
>>   Host: n01
>> 
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>> allow registering part of your physical memory.  This can cause MPI jobs to
>> run with erratic performance, hang, and/or crash.
>> 
>> This may be caused by your OpenFabrics vendor limiting the amount of
>> physical memory that can be registered.  You should investigate the
>> relevant Linux kernel module parameters that control how much physical
>> memory can be registered, and increase them to allow registering all
>> physical memory on your machine.
>> 
>> See this Open MPI FAQ item for more information on these Linux kernel module
>> parameters:
>> 
>>     http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>> 
>>   Local host:              n01
>>   Registerable memory:     32768 MiB
>>   Total memory:            65503 MiB
>> 
>> Your MPI job will continue, but may be behave poorly and/or hang.
>> --------------------------------------------------------------------------
>> Process 0 on n01 out of 1
>> [n01:13534] 7 more processes have sent help message help-mpi-btl-udapl.txt / 
>> dat_ia_open fail
>> [n01:13534] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>> help / error messages
> Following my setup and other info:
> OS: CentOS 6.3 x86_64
> installed ofed 3.5 from source ( ./install.pl --all)
> installed openmpi 1.6.4 with the following build parameters:
>> rpmbuild --rebuild openmpi-1.6.4-1.src.rpm --define '_prefix 
>> /opt/openmpi/1.6.4/gcc' --define '_defaultdocdir /opt/openmpi/1.6.4/gcc' 
>> --define '_mandir %{_prefix}/share/man' --define '_datadir %{_prefix}/share' 
>> --define 'configure_options --with-openib=/usr 
>> --with-openib-libdir=/usr/lib64 CC=gcc CXX=g++ F77=gfortran FC=gfortran 
>> --enable-mpirun-prefix-by-default --target=x86_64-unknown-linux-gnu 
>> --with-hwloc=/usr/local --with-libltdl --enable-branch-probabilities 
>> --with-udapl --with-sge --disable-vt' --define 'use_default_rpm_opt_flags 1' 
>> --define '_name openmpi-1.6.4_gcc' --define 'install_shell_scripts 1' 
>> --define 'shell_scripts_basename mpivars' --define '_usr /usr' --define 
>> 'ofed 0' 2>&1 | tee openmpi.build.sge
> (disable -vt was used due to cuda presence which is automatically linked by 
> vt, and becomes a dependency with no matching rpm).
> 
> memorylocked is unlimited:
>> ->ulimit -a
>> core file size          (blocks, -c) 0
>> data seg size           (kbytes, -d) unlimited
>> scheduling priority             (-e) 0
>> file size               (blocks, -f) unlimited
>> pending signals                 (-i) 515028
>> max locked memory       (kbytes, -l) unlimited
>> max memory size         (kbytes, -m) unlimited
>> open files                      (-n) 1024
>> pipe size            (512 bytes, -p) 8
>> POSIX message queues     (bytes, -q) 819200
>> real-time priority              (-r) 0
>> stack size              (kbytes, -s) 10240
>> cpu time               (seconds, -t) unlimited
>> max user processes              (-u) 1024
>> virtual memory          (kbytes, -v) unlimited
>> file locks                      (-x) unlimited
> IB devices are present:
>> ->ibv_devinfo
>> hca_id:    mlx4_0
>>     transport:            InfiniBand (0)
>>     fw_ver:                2.9.1000
>>     node_guid:            0002:c903:004d:b0e2
>>     sys_image_guid:            0002:c903:004d:b0e5
>>     vendor_id:            0x02c9
>>     vendor_part_id:            26428
>>     hw_ver:                0xB0
>>     board_id:            MT_0D90110009
>>     phys_port_cnt:            1
>>         port:    1
>>             state:            PORT_ACTIVE (4)
>>             max_mtu:        4096 (5)
>>             active_mtu:        4096 (5)
>>             sm_lid:            2
>>             port_lid:        53
>>             port_lmc:        0x00
>>             link_layer:        InfiniBand
> 
> the hello program source:
>> ->cat hello.c
>> #include <stdio.h>
>> #include <mpi.h>
>> 
>> int main(int argc, char *argv[]) {
>>   int numprocs, rank, namelen;
>>   char processor_name[MPI_MAX_PROCESSOR_NAME];
>> 
>>   MPI_Init(&argc, &argv);
>>   MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
>>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>   MPI_Get_processor_name(processor_name, &namelen);
>> 
>>   printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
>> 
>>   MPI_Finalize();
>> }
> simply compiled as:
>> mpicc hello.c -o hello
> 
> the IB modules seem to be present:
>> ->service openibd status
>> 
>>   HCA driver loaded
>> 
>> Configured IPoIB devices:
>> ib0
>> 
>> Currently active IPoIB devices:
>> ib0
>> 
>> The following OFED modules are loaded:
>> 
>>   rdma_ucm
>>   rdma_cm
>>   ib_addr
>>   ib_ipoib
>>   mlx4_core
>>   mlx4_ib
>>   mlx4_en
>>   ib_mthca
>>   ib_uverbs
>>   ib_umad
>>   ib_sa
>>   ib_cm
>>   ib_mad
>>   ib_core
>>   iw_cxgb3
>>   iw_cxgb4
>>   iw_nes
>>   ib_qib
> 
> Can anyone help?
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to