Hi: Setting mpi_leave_pinned to 0 allows my application to run to completion when running with openib active. I realize that it's probably not going to help my application's performance, but since "ON" is the default, I'd like to understand what's happening. There's definitely a dependence on problem size: smaller problems run to completion while larger problems hang at different points in the code. Are there buffer sizes (or other BTL settings) I can adjust to understand my problem better?
Thanks, Allen On Thu, 2009-08-13 at 10:11 +0300, Lenny Verkhovsky wrote: > Hi, > 1. > The Mellanox has a newer fw for those > HCAshttp://www.mellanox.com/content/pages.php?pg=firmware_table_IH3Lx > > I am not sure if it will help, but newer fw usually have some bug > fixes. > > 2. > try to disable leave_pinned during the run. It's on by default in > 1.3.3 > > Lenny. > > On Thu, Aug 13, 2009 at 5:12 AM, Allen Barnett > <al...@transpireinc.com> wrote: > Hi: > I recently tried to build my MPI application against OpenMPI > 1.3.3. It > worked fine with OMPI 1.2.9, but with OMPI 1.3.3, it hangs > part way > through. It does a fair amount of comm, but eventually it > stops in a > Send/Recv point-to-point exchange. If I turn off the openib > btl, it runs > to completion. Also, I built 1.3.3 with memchecker (which is > very nice; > thanks to everyone who worked on that!) and it runs to > completion, even > with openib active. > > Our cluster consists of dual dual-core opteron boxes with > Mellanox > MT25204 (InfiniHost III Lx) HCAs and a Mellanox MT47396 > Infiniscale-III > switch. We're running RHEL 4.8 which appears to include OFED > 1.4. I've > built everything using GCC 4.3.2. Here is the output from > ibv_devinfo. > "ompi_info --all" is attached. > $ ibv_devinfo > hca_id: mthca0 > fw_ver: 1.1.0 > node_guid: 0002:c902:0024:3284 > sys_image_guid: 0002:c902:0024:3287 > vendor_id: 0x02c9 > vendor_part_id: 25204 > hw_ver: 0xA0 > board_id: MT_03B0140002 > phys_port_cnt: 1 > port: 1 > state: active (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 1 > port_lid: 1 > port_lmc: 0x00 > > I'd appreciate any tips for debugging this. > Thanks, > Allen