Hi, I'm having problems getting the MPIRandomAccess part of the HPCC benchmark to run with more than 32 processes on each node (each node has 4 x AMD 6172 so 48 cores total). Once I go past 32 processes I get an error like:
[compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] error creating qp errno says Cannot allocate memory [compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:815:rml_recv_cb] error in endpoint reply start connect [compute-1-13.local:06117] [[5637,0],0]-[[5637,1],18] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [compute-1-13.local:6137] *** An error occurred in MPI_Isend [compute-1-13.local:6137] *** on communicator MPI_COMM_WORLD [compute-1-13.local:6137] *** MPI_ERR_OTHER: known error not in list [compute-1-13.local:6137] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [compute-1-13.local][[5637,1],26][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] error creating qp errno says Cannot allocate memory [[5637,1],66][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3227:handle_wc] from compute-1-13.local to: compute-1-13 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 278870912 opcode I've tried changing btl_openib_receive_queues from P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 to P,128,512,256,512:S,2048,512,256,32:S,12288,512,256,32:S,65536,512,256,32 doing this lets the code run without the error, but it does so extremely slowly - I'm also seeing errors in dmesg such as: CPU 12: Modules linked in: nfs fscache nfs_acl blcr(U) blcr_imports(U) autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables cpufreq_ondemand powernow_k8 freq_table rdma_ucm(U) ib_sd p(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) cxgb3(U) mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_p c lp parport joydev shpchp sg i2c_piix4 i2c_core ib_qib(U) dca ib_mad(U) ib_core(U) igb 8021q serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_ mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 3980, comm: qib/12 Tainted: G 2.6.18-164.6.1.el5 #1 RIP: 0010:[<ffffffff80094409>] [<ffffffff80094409>] tasklet_action+0x90/0xfd RSP: 0018:ffff810c2f1bff40 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff810c2f1bff30 RDX: 0000000000000000 RSI: ffff81042f063400 RDI: ffffffff8030d180 RBP: ffff810c2f1bfec0 R08: 0000000000000001 R09: ffff8104aec2d000 R10: ffff810c2f1bff00 R11: ffff810c2f1bff00 R12: ffffffff8005dc8e R13: ffff81042f063480 R14: ffffffff80077874 R15: ffff810c2f1bfec0 FS: 00002b20829592e0(0000) GS:ffff81042f186bc0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002b2080b70720 CR3: 0000000000201000 CR4: 00000000000006e0 Call Trace: <IRQ> [<ffffffff8001235a>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006cb20>] do_softirq+0x2c/0x85 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff800da30c>] __kmalloc+0x97/0x9f [<ffffffff88220d8b>] :ib_qib:qib_verbs_send+0xdb3/0x104a [<ffffffff80064b20>] _spin_unlock_irqrestore+0x8/0x9 [<ffffffff881f66ca>] :ib_qib:qib_make_rc_req+0xbb1/0xbbf [<ffffffff881f5b19>] :ib_qib:qib_make_rc_req+0x0/0xbbf [<ffffffff881f8187>] :ib_qib:qib_do_send+0x0/0x950 [<ffffffff881f8aa1>] :ib_qib:qib_do_send+0x91a/0x950 [<ffffffff8002e2e3>] __wake_up+0x38/0x4f [<ffffffff881f8187>] :ib_qib:qib_do_send+0x0/0x950 [<ffffffff8004d7fb>] run_workqueue+0x94/0xe4 [<ffffffff8004a043>] worker_thread+0x0/0x122 [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8004a133>] worker_thread+0xf0/0x122 [<ffffffff8008c3bd>] default_wake_function+0x0/0xe [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003297c>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003287e>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Any thoughts on how to proceed? I'm running OpenMPI 1.4.3 compiled with gcc 4.1.2 and OFED 1.5.3.1 Thanks, Rob -- Robert Horton System Administrator (Research Support) - School of Mathematical Sciences Queen Mary, University of London r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345