Hi, 1. The Mellanox has a newer fw for those HCAshttp:// www.mellanox.com/content/pages.php?pg=firmware_table_IH3Lx I am not sure if it will help, but newer fw usually have some bug fixes. 2. try to disable leave_pinned during the run. It's on by default in 1.3.3 Lenny.
On Thu, Aug 13, 2009 at 5:12 AM, Allen Barnett <al...@transpireinc.com>wrote: > Hi: > I recently tried to build my MPI application against OpenMPI 1.3.3. It > worked fine with OMPI 1.2.9, but with OMPI 1.3.3, it hangs part way > through. It does a fair amount of comm, but eventually it stops in a > Send/Recv point-to-point exchange. If I turn off the openib btl, it runs > to completion. Also, I built 1.3.3 with memchecker (which is very nice; > thanks to everyone who worked on that!) and it runs to completion, even > with openib active. > > Our cluster consists of dual dual-core opteron boxes with Mellanox > MT25204 (InfiniHost III Lx) HCAs and a Mellanox MT47396 Infiniscale-III > switch. We're running RHEL 4.8 which appears to include OFED 1.4. I've > built everything using GCC 4.3.2. Here is the output from ibv_devinfo. > "ompi_info --all" is attached. > $ ibv_devinfo > hca_id: mthca0 > fw_ver: 1.1.0 > node_guid: 0002:c902:0024:3284 > sys_image_guid: 0002:c902:0024:3287 > vendor_id: 0x02c9 > vendor_part_id: 25204 > hw_ver: 0xA0 > board_id: MT_03B0140002 > phys_port_cnt: 1 > port: 1 > state: active (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 1 > port_lid: 1 > port_lmc: 0x00 > > I'd appreciate any tips for debugging this. > Thanks, > Allen > > -- > Allen Barnett > Transpire, Inc > E-Mail: al...@transpireinc.com > Skype: allenbarnett > Ph: 518-887-2930 > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >