Hi,

I'm having problems getting the MPIRandomAccess part of the HPCC
benchmark to run with more than 32 processes on each node (each node has
4 x AMD 6172 so 48 cores total). Once I go past 32 processes I get an
error like:

[compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
 error creating qp errno says Cannot allocate memory
[compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:815:rml_recv_cb]
 error in endpoint reply start connect
[compute-1-13.local:06117] [[5637,0],0]-[[5637,1],18] mca_oob_tcp_msg_recv: 
readv failed: Connection reset by peer (104)
[compute-1-13.local:6137] *** An error occurred in MPI_Isend
[compute-1-13.local:6137] *** on communicator MPI_COMM_WORLD
[compute-1-13.local:6137] *** MPI_ERR_OTHER: known error not in list
[compute-1-13.local:6137] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[compute-1-13.local][[5637,1],26][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
 error creating qp errno says Cannot allocate memory
[[5637,1],66][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3227:handle_wc]
 from compute-1-13.local to: compute-1-13 error polling LP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 278870912 opcode 

I've tried changing btl_openib_receive_queues from
P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
to
P,128,512,256,512:S,2048,512,256,32:S,12288,512,256,32:S,65536,512,256,32

doing this lets the code run without the error, but it does so extremely
slowly - I'm also seeing errors in dmesg such as:

CPU 12:
Modules linked in: nfs fscache nfs_acl blcr(U) blcr_imports(U) autofs4 
ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc ip_conntrack_netbios_ns 
ipt_REJECT xt_state
 ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp 
ip6table_filter ip6_tables x_tables cpufreq_ondemand powernow_k8 freq_table 
rdma_ucm(U) ib_sd
p(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) 
ib_sa(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) iw_nes(U) 
iw_cxgb3(U) cxgb3(U) 
mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) dm_mirror dm_multipath scsi_dh 
video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac 
parport_p
c lp parport joydev shpchp sg i2c_piix4 i2c_core ib_qib(U) dca ib_mad(U) 
ib_core(U) igb 8021q serio_raw pcspkr dm_raid45 dm_message dm_region_hash 
dm_log dm_mod dm_
mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 3980, comm: qib/12 Tainted: G      2.6.18-164.6.1.el5 #1
RIP: 0010:[<ffffffff80094409>]  [<ffffffff80094409>] tasklet_action+0x90/0xfd
RSP: 0018:ffff810c2f1bff40  EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff810c2f1bff30
RDX: 0000000000000000 RSI: ffff81042f063400 RDI: ffffffff8030d180
RBP: ffff810c2f1bfec0 R08: 0000000000000001 R09: ffff8104aec2d000
R10: ffff810c2f1bff00 R11: ffff810c2f1bff00 R12: ffffffff8005dc8e
R13: ffff81042f063480 R14: ffffffff80077874 R15: ffff810c2f1bfec0
FS:  00002b20829592e0(0000) GS:ffff81042f186bc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b2080b70720 CR3: 0000000000201000 CR4: 00000000000006e0

Call Trace:
 <IRQ>  [<ffffffff8001235a>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cb20>] do_softirq+0x2c/0x85
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff800da30c>] __kmalloc+0x97/0x9f
 [<ffffffff88220d8b>] :ib_qib:qib_verbs_send+0xdb3/0x104a
 [<ffffffff80064b20>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff881f66ca>] :ib_qib:qib_make_rc_req+0xbb1/0xbbf
 [<ffffffff881f5b19>] :ib_qib:qib_make_rc_req+0x0/0xbbf
 [<ffffffff881f8187>] :ib_qib:qib_do_send+0x0/0x950
 [<ffffffff881f8aa1>] :ib_qib:qib_do_send+0x91a/0x950
 [<ffffffff8002e2e3>] __wake_up+0x38/0x4f
 [<ffffffff881f8187>] :ib_qib:qib_do_send+0x0/0x950
 [<ffffffff8004d7fb>] run_workqueue+0x94/0xe4
 [<ffffffff8004a043>] worker_thread+0x0/0x122
 [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8004a133>] worker_thread+0xf0/0x122
 [<ffffffff8008c3bd>] default_wake_function+0x0/0xe
 [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003297c>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003287e>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Any thoughts on how to proceed?

I'm running OpenMPI 1.4.3 compiled with gcc 4.1.2 and OFED 1.5.3.1

Thanks,
Rob
-- 
Robert Horton
System Administrator (Research Support) - School of Mathematical Sciences
Queen Mary, University of London
r.hor...@qmul.ac.uk  -  +44 (0) 20 7882 7345

Reply via email to