Hi: Setting mpi_leave_pinned to 0 allows my application to run to
completion when running with openib active. I realize that it's probably
not going to help my application's performance, but since "ON" is the
default, I'd like to understand what's happening. There's definitely a
dependence on problem size: smaller problems run to completion while
larger problems hang at different points in the code. Are there buffer
sizes (or other BTL settings) I can adjust to understand my problem
better?

Thanks,
Allen

On Thu, 2009-08-13 at 10:11 +0300, Lenny Verkhovsky wrote:
> Hi, 
> 1.
> The Mellanox has a newer fw for those
> HCAshttp://www.mellanox.com/content/pages.php?pg=firmware_table_IH3Lx
> 
> I am not sure if it will help, but newer fw usually have some bug
> fixes.
> 
> 2.
> try to disable leave_pinned during the run. It's on by default in
> 1.3.3
> 
> Lenny.
> 
> On Thu, Aug 13, 2009 at 5:12 AM, Allen Barnett
> <al...@transpireinc.com> wrote:
>         Hi:
>         I recently tried to build my MPI application against OpenMPI
>         1.3.3. It
>         worked fine with OMPI 1.2.9, but with OMPI 1.3.3, it hangs
>         part way
>         through. It does a fair amount of comm, but eventually it
>         stops in a
>         Send/Recv point-to-point exchange. If I turn off the openib
>         btl, it runs
>         to completion. Also, I built 1.3.3 with memchecker (which is
>         very nice;
>         thanks to everyone who worked on that!) and it runs to
>         completion, even
>         with openib active.
>         
>         Our cluster consists of dual dual-core opteron boxes with
>         Mellanox
>         MT25204 (InfiniHost III Lx) HCAs and a Mellanox MT47396
>         Infiniscale-III
>         switch. We're running RHEL 4.8 which appears to include OFED
>         1.4. I've
>         built everything using GCC 4.3.2. Here is the output from
>         ibv_devinfo.
>         "ompi_info --all" is attached.
>         $ ibv_devinfo
>         hca_id: mthca0
>                fw_ver:                         1.1.0
>                node_guid:                      0002:c902:0024:3284
>                sys_image_guid:                 0002:c902:0024:3287
>                vendor_id:                      0x02c9
>                vendor_part_id:                 25204
>                hw_ver:                         0xA0
>                board_id:                       MT_03B0140002
>                phys_port_cnt:                  1
>                        port:   1
>                                state:                  active (4)
>                                max_mtu:                2048 (4)
>                                active_mtu:             2048 (4)
>                                sm_lid:                 1
>                                port_lid:               1
>                                port_lmc:               0x00
>         
>         I'd appreciate any tips for debugging this.
>         Thanks,
>         Allen


Reply via email to