Hi Ralph
Thank you.
I switched back to memlock unlimited, rebooted the nodes,
and after that OpenMPI is working right with Infinband.
As for why the problem happened first place,
I can only think that somehow the Infiniband kernel modules and
driver didn't like my reducing the memlock limit,
and not telling/restarting them or rebooting the nodes.
As you said, the problem may not have been
the smaller memlock limit.
Maybe Infiniband kept stale information about
the memory limits and fail.
The jobs would run right over TCP on Ethernet, with the
same memlock that made Infinband fail.
I may try a less_than_unlimited memlock value later (tests are not
easy on production machines).
In any case, it is kind of hard to come up
with a sensible number
(RAM/number_of_cores? less? more? a magic value?).
Any suggestions are welcome, of course.
Thank you,
Gus Correa
On 08/12/2013 07:43 PM, Ralph Castain wrote:
Seems strange that it would have something to do with IB - it seems that alloc
itself is failing, and at only 512 bytes, that doesn't seem like something IB
would cause.
If you write a little program that calls alloc (no MPI), does it also fail?
On Aug 12, 2013, at 3:35 PM, Gus Correa wrote:
Hi Ralph
Sorry if this is more of an IB than an OMPI problem,
but my view angle shows it through the OMPI jobs failing.
Yes, indeed I was setting memlock to unlimited in limits.conf
and in the pbs_mom, restarting everything, relaunching the job.
The error message changes, but it still fails on Infiniband,
now complaining about the IB driver, but also that it cannot
allocate memory.
Weird because when I ssh to the node and do ibstat it
responds (see below, please).
I actually ran ibstat everywhere, and all IB host adapters seem OK.
Thank you,
Gus Correa
*** the job stderr **
unable to alloc 512 bytes
Abort: Command not found.
unable to realloc 1600 bytes
Abort: Command not found.
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: failed to
map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: failed to
map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: failed
to map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: failed
to map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'ipathverbs':
libipathverbs-rdmav2.so: failed to map segment from shared object: Cannot
allocate memory
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: failed to
map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: failed to
map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: failed
to map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: failed
to map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'ipathverbs':
libipathverbs-rdmav2.so: failed to map segment from shared object: Cannot
allocate memory
[node15:29683] *** Process received signal ***
[node15:29683] Signal: Segmentation fault (11)
[node15:29683] Signal code: (128)
[node15:29683] Failing at address: (nil)
[node15:29683] *** End of error message ***
--
mpiexec noticed that process rank 0 with PID 29683 on node node15.cluster
exited on signal 11 (Segmentation fault).
--
[node15.cluster:29682] [[7785,0],0]-[[7785,1],2] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
*** ibstat on node15 *
[root@node15 ~]# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.7.700
Hardware version: b0
Node GUID: 0x00259016284c
System image GUID: 0x00259016284f
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 11
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x00259016284d
Link layer: IB
On 08/12/2013 05:29 PM, Ralph Castain wrote:
No, this has nothing to do with the registration limit.
For some reason, the system is refusing to create a thread -
i.e., it is pthread_create that is failing.
I have no idea what