Dear Roland, Thank you so much. This was very helpful.
Best, Rio >>>>>> "Mike" == Mike Dubman <mi...@dev.mellanox.co.il> writes: > > Mike> so, it seems you have old ofed w/o this parameter. Can you > Mike> install latest Mellanox ofed? or check which community ofed > Mike> has it? > > Rio is using the kernel.org drivers that are part of Ubuntu/3.13.x and > log_num_mtt is not a parameter in those drivers. In fact log_num_mtt > has never been a parameter in the kernel.org sources (just checked the > git commit history). And it's not needed anymore either, since the > following commit (which is also part of OFED 3.12 btw; Mike, seems > Mellanox OFED is behind with this respect): > ----------------------------------------------------------- > commit db5a7a65c05867cb6ff5cb6d556a0edfce631d2d > Author: Roland Dreier <rol...@purestorage.com> > Date: Mon Mar 5 10:05:28 2012 -0800 > > mlx4_core: Scale size of MTT table with system RAM > > The current driver defaults to 1M MTT segments, where each segment holds > 8 MTT entries. This limits the total memory registered to 8M * PAGE_SIZE > which is 32GB with 4K pages. Since systems that have much more memory > are pretty common now (at least among systems with InfiniBand hardware), > this limit ends up getting hit in practice quite a bit. > > Handle this by having the driver allocate at least enough MTT entries to > cover 2 * totalram pages. > > Signed-off-by: Roland Dreier <rol...@purestorage.com> > ----------------------------------------------------------- > > The relevant code segment (drivers/net/ethernet/mellanox/mlx4/profile.c): > > ----------------------------------------------------------- > /* > * We want to scale the number of MTTs with the size of the > * system memory, since it makes sense to register a lot of > * memory on a system with a lot of memory. As a heuristic, > * make sure we have enough MTTs to cover twice the system > * memory (with PAGE_SIZE entries). > * > * This number has to be a power of two and fit into 32 bits > * due to device limitations, so cap this at 2^31 as well. > * That limits us to 8TB of memory registration per HCA with > * 4KB pages, which is probably OK for the next few months. > */ > si_meminfo(&si); > request->num_mtt = > roundup_pow_of_two(max_t(unsigned, request->num_mtt, > min(1UL << (31 - log_mtts_per_seg), > si.totalram >> (log_mtts_per_seg > - 1)))); > ----------------------------------------------------------- > > So the point here is that OpenMPI should check the mlx4 driver versions > and not output false warnings when newer drivers are used. Didn't check > whether this is fixed in the OpenMPI code repositories yet. It's not > fixed in 1.8.2rc4 anyway (static uint64_t calculate_max_reg in > ompi/mca/btl/openib/btl_openib.c). Also, the OpenMPI FAQ should be > corrected accordingly. > > Rio as a note for you: You can safely ignore the warning. > > Cheers, > > Roland > > ------- > http://www.q-leap.com / http://qlustar.com > --- HPC / Storage / Cloud Linux Cluster OS --- > > Mike> On Tue, Aug 19, 2014 at 9:34 AM, Rio Yokota > Mike> <rioyok...@mac.com> wrote: > >>> Here is what "modinfo mlx4_core" gives >>> >>> filename: >>> /lib/modules/3.13.0-34-generic/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko >>> version: 2.2-1 license: Dual BSD/GPL description: Mellanox >>> ConnectX HCA low-level driver author: Roland Dreier srcversion: >>> 3AE29A0A6538EBBE9227361 alias: >>> pci:v000015B3d00001010sv*sd*bc*sc*i* alias: >>> pci:v000015B3d0000100Fsv*sd*bc*sc*i* alias: >>> pci:v000015B3d0000100Esv*sd*bc*sc*i* alias: >>> pci:v000015B3d0000100Dsv*sd*bc*sc*i* alias: >>> pci:v000015B3d0000100Csv*sd*bc*sc*i* alias: >>> pci:v000015B3d0000100Bsv*sd*bc*sc*i* alias: >>> pci:v000015B3d0000100Asv*sd*bc*sc*i* alias: >>> pci:v000015B3d00001009sv*sd*bc*sc*i* alias: >>> pci:v000015B3d00001008sv*sd*bc*sc*i* alias: >>> pci:v000015B3d00001007sv*sd*bc*sc*i* alias: >>> pci:v000015B3d00001006sv*sd*bc*sc*i* alias: >>> pci:v000015B3d00001005sv*sd*bc*sc*i* alias: >>> pci:v000015B3d00001004sv*sd*bc*sc*i* alias: >>> pci:v000015B3d00001003sv*sd*bc*sc*i* alias: >>> pci:v000015B3d00001002sv*sd*bc*sc*i* alias: >>> pci:v000015B3d0000676Esv*sd*bc*sc*i* alias: >>> pci:v000015B3d00006746sv*sd*bc*sc*i* alias: >>> pci:v000015B3d00006764sv*sd*bc*sc*i* alias: >>> pci:v000015B3d0000675Asv*sd*bc*sc*i* alias: >>> pci:v000015B3d00006372sv*sd*bc*sc*i* alias: >>> pci:v000015B3d00006750sv*sd*bc*sc*i* alias: >>> pci:v000015B3d00006368sv*sd*bc*sc*i* alias: >>> pci:v000015B3d0000673Csv*sd*bc*sc*i* alias: >>> pci:v000015B3d00006732sv*sd*bc*sc*i* alias: >>> pci:v000015B3d00006354sv*sd*bc*sc*i* alias: >>> pci:v000015B3d0000634Asv*sd*bc*sc*i* alias: >>> pci:v000015B3d00006340sv*sd*bc*sc*i* depends: intree: Y vermagic: >>> 3.13.0-34-generic SMP mod_unload modversions signer: Magrathea: >>> Glacier signing key sig_key: >>> 50:0B:C5:C8:7D:4B:11:5C:F3:C1:50:4F:7A:92:E2:33:C6:14:3D:58 >>> sig_hashalgo: sha512 parm: debug_level:Enable debug tracing if > >>> 0 (int) parm: msi_x:attempt to use MSI-X if nonzero (int) parm: >>> num_vfs:enable #num_vfs functions if num_vfs > 0 >>> num_vfs=port1,port2,port1+2 (array of byte) parm: probe_vf:number >>> of vfs to probe by pf driver (num_vfs > 0) >>> probe_vf=port1,port2,port1+2 (array of byte) parm: >>> log_num_mgm_entry_size:log mgm size, that defines the num of qp >>> per mcg, for example: 10 gives 248.range: 7 <= >>> log_num_mgm_entry_size <= 12. To activate device managed flow >>> steering when available, set to -1 (int) parm: >>> enable_64b_cqe_eqe:Enable 64 byte CQEs/EQEs when the FW supports >>> this (default: True) (bool) parm: log_num_mac:Log2 max number of >>> MACs per ETH port (1-7) (int) parm: log_num_vlan:Log2 max number >>> of VLANs per ETH port (0-7) (int) parm: use_prio:Enable steering >>> by VLAN priority on ETH ports (0/1, default 0) (bool) parm: >>> log_mtts_per_seg:Log2 number of MTT entries per segment (1-7) >>> (int) parm: port_type_array:Array of port types: HW_DEFAULT (0) >>> is default 1 for IB, 2 for Ethernet (array of int) parm: >>> enable_qos:Enable Quality of Service support in the HCA (default: >>> off) (bool) parm: internal_err_reset:Reset device on internal >>> errors if non-zero (default 1, in SRIOV mode default is 0) (int) >>> >>> most likely you installing old ofed which does not have this >>> parameter: >>> >>> try: >>> >>> #modinfo mlx4_core >>> >>> and see if it is there. I would suggest install latest OFED or >>> Mellanox OFED. >>> >>> >>> On Mon, Aug 18, 2014 at 9:53 PM, Rio Yokota <rioyok...@mac.com> >>> wrote: >>> >>>> I get "ofed_info: command not found". Note that I don't install >>>> the entire OFED, but do a component wise installation by doing >>>> "apt-get install infiniband-diags ibutils ibverbs-utils >>>> libmlx4-dev" for the drivers and utilities. >>>> >>>> Hi, what ofed version do you use? (ofed_info -s) >>>> >>>> >>>> On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota <rioyok...@mac.com> >>>> wrote: >>>> >>>>> I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI >>>>> gives the following warning upon execution, which did not >>>>> appear before the upgrade. >>>>> >>>>> WARNING: It appears that your OpenFabrics subsystem is >>>>> configured to only allow registering part of your physical >>>>> memory. This can cause MPI jobs to run with erratic >>>>> performance, hang, and/or crash. >>>>> >>>>> Everything that I could find on google suggests to change >>>>> log_num_mtt, but I cannot do this for the following reasons: >>>>> 1. There is no log_num_mtt in /sys/module/mlx4_core/parameters/ >>>>> 2. Adding "options mlx4_core log_num_mtt=24" to >>>>> /etc/modprobe.d/mlx4.conf doesn't seem to change anything >>>>> 3. I am not sure how I can restart the driver because there is >>>>> no >>>>> "/etc/init.d/openibd" file (I've rebooted the system but it >>>>> didn't do anything to create log_num_mtt) >>>>> >>>>> [Template information] >>>>> 1. OpenFabrics is from the Ubuntu distribution using "apt-get >>>>> install >>>>> infiniband-diags ibutils ibverbs-utils libmlx4-dev" >>>>> 2. OS is Ubuntu 14.04 LTS >>>>> 3. Subnet manager is from the Ubuntu distribution using >>>>> "apt-get install >>>>> opensm" >>>>> 4. Output of ibv_devinfo is: >>>>> hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.10.600 >>>>> node_guid: 0002:c903:003d:52b0 sys_image_guid: >>>>> 0002:c903:003d:52b3 vendor_id: 0x02c9 vendor_part_id: 4099 >>>>> hw_ver: 0x0 board_id: MT_1100120019 phys_port_cnt: 1 port: 1 >>>>> state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) >>>>> sm_lid: 1 port_lid: 1 port_lmc: 0x00 link_layer: InfiniBand >>>>> 5. Output of ifconfig for IB is >>>>> ib0 Link encap:UNSPEC HWaddr >>>>> 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00 inet >>>>> addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 >>>>> addr: fe80::202:c903:3d:52b1/64 Scope:Link UP BROADCAST RUNNING >>>>> MULTICAST MTU:2044 Metric:1 RX packets:26 errors:0 dropped:0 >>>>> overruns:0 frame:0 TX packets:34 errors:0 dropped:16 overruns:0 >>>>> carrier:0 collisions:0 txqueuelen:256 RX bytes:5843 (5.8 KB) TX >>>>> bytes:4324 (4.3 KB) >>>>> 6. ulimit -l is "unlimited" >>>>> >>>>> Thanks, Rio _______________________________________________ >>>>> users mailing list us...@open-mpi.org Subscription: