Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled

2013-08-13 Thread Gus Correa

Hi Ralph

Thank you.
I switched back to memlock unlimited, rebooted the nodes,
and after that OpenMPI is working right with Infinband.

As for why the problem happened first place,
I can only think that somehow the Infiniband kernel modules and
driver didn't like my reducing the memlock limit,
and not telling/restarting them or rebooting the nodes.
As you said, the problem may not have been
the smaller memlock limit.
Maybe Infiniband kept stale information about
the memory limits and fail.
The jobs would run right over TCP on Ethernet, with the
same memlock that made Infinband fail.

I may try a less_than_unlimited memlock value later (tests are not
easy on production machines).
In any case, it is kind of hard to come up
with a sensible number
(RAM/number_of_cores? less? more? a magic value?).
Any suggestions are welcome, of course.

Thank you,
Gus Correa


On 08/12/2013 07:43 PM, Ralph Castain wrote:

Seems strange that it would have something to do with IB - it seems that alloc 
itself is failing, and at only 512 bytes, that doesn't seem like something IB 
would cause.

If you write a little program that calls alloc (no MPI), does it also fail?


On Aug 12, 2013, at 3:35 PM, Gus Correa  wrote:


Hi Ralph

Sorry if this is more of an IB than an OMPI problem,
but my view angle shows it through the OMPI jobs failing.

Yes, indeed I was setting memlock to unlimited in limits.conf
and in the pbs_mom, restarting everything, relaunching the job.
The error message changes, but it still fails on Infiniband,
now complaining about the IB driver, but also that it cannot
allocate memory.

Weird because when I ssh to the node and do ibstat it
responds (see below, please).
I actually ran ibstat everywhere, and all IB host adapters seem OK.

Thank you,
Gus Correa


*** the job stderr **
unable to alloc 512 bytes
Abort: Command not found.
unable to realloc 1600 bytes
Abort: Command not found.
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: failed to 
map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: failed to 
map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: failed 
to map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: failed 
to map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'ipathverbs': 
libipathverbs-rdmav2.so: failed to map segment from shared object: Cannot 
allocate memory
libibverbs: Warning: no userspace device-specific driver found for 
/sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: failed to 
map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: failed to 
map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: failed 
to map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: failed 
to map segment from shared object: Cannot allocate memory
libibverbs: Warning: couldn't load driver 'ipathverbs': 
libipathverbs-rdmav2.so: failed to map segment from shared object: Cannot 
allocate memory
[node15:29683] *** Process received signal ***
[node15:29683] Signal: Segmentation fault (11)
[node15:29683] Signal code:  (128)
[node15:29683] Failing at address: (nil)
[node15:29683] *** End of error message ***
--
mpiexec noticed that process rank 0 with PID 29683 on node node15.cluster 
exited on signal 11 (Segmentation fault).
--
[node15.cluster:29682] [[7785,0],0]-[[7785,1],2] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)


*** ibstat on node15 *

[root@node15 ~]# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.7.700
Hardware version: b0
Node GUID: 0x00259016284c
System image GUID: 0x00259016284f
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 11
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x00259016284d
Link layer: IB




On 08/12/2013 05:29 PM, Ralph Castain wrote:

No, this has nothing to do with the registration limit.
For some reason, the system is refusing to create a thread -
i.e., it is pthread_create that is failing.
I have no idea what 

[OMPI users] Finalize() does not return

2013-08-13 Thread Hazelrig, Chris CTR (US)
Greetings,

I am using OpenMPI 1.4.3-1.1.el6 on RedHawk Linux 6.0.1 (Glacier) / RedHat 
Enterprise Linux Workstation Release 6.1 (Santiago).  I am currently working 
through some issues that I encountered after upgrading from RedHawk 5.2 / RHEL 
5.2 and OpenMPI 1.4.3-1 (openmpi-gcc_1.4.3-1).  It seems that since the 
upgrades my software does not return from the call to the Finalize() routine.  
All threads enter the Finalize() routine and never return.  I wrote a simple 
test program to try to simplify troubleshooting and Finalize() works as 
expected, i.e., all threads return from the Finalize() call.  This suggests the 
problem is in my code.  I have searched the man pages and user forums to no 
avail.  Has anyone else encountered this problem?  What could cause such 
behavior?  I wondered if maybe there is still some prior communication that was 
left unfinished, but I believe I have verified that is not the case, plus my 
understanding of how Finalize() works is that it would error/exception out in 
such a situation rather than just sit there, but I could be wrong.

Not sure what additional information may be needed by the community to aid in 
troubleshooting, but will be happy to provide whatever else is needed.

Kind regards,
Chris Hazelrig
SimTech


smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] Finalize() does not return

2013-08-13 Thread Gus Correa

Hi Chris

As you said, pending prior communication,
is a candidate.
Another that I saw is MPI_Finalize inside a conditional,
where the condition may or may not be met by all ranks:
if (condition) {
MPI_Finalize();
}

Regardless of the cause,
to check the ranks that reach MPI_Finalize,
did you try the old-fashioned printf commands right before
MPI_Finalize?
Something like this:

printf("Rank %d before MPI_Finalize\n",myrank);
fflush(stdout);
MPI_Finalize();

or Fortran:

print *, 'Rank ', myrank, ' before MPI_Finalize'
call flush(6)
call MPI_Finalize(ierror)

My two cents,
Gus Correa

On 08/13/2013 02:51 PM, Hazelrig, Chris CTR (US) wrote:

Greetings,
I am using OpenMPI 1.4.3-1.1.el6 on RedHawk Linux 6.0.1 (Glacier) /
RedHat Enterprise Linux Workstation Release 6.1 (Santiago). I am
currently working through some issues that I encountered after upgrading
from RedHawk 5.2 / RHEL 5.2 and OpenMPI 1.4.3-1 (openmpi-gcc_1.4.3-1).
It seems that since the upgrades my software does not return from the
call to the Finalize() routine. All threads enter the Finalize() routine
and never return. I wrote a simple test program to try to simplify
troubleshooting and Finalize() works as expected, i.e., all threads
return from the Finalize() call. This suggests the problem is in my
code. I have searched the man pages and user forums to no avail. Has
anyone else encountered this problem? What could cause such behavior? I
wondered if maybe there is still some prior communication that was left
unfinished, but I believe I have verified that is not the case, plus my
understanding of how Finalize() works is that it would error/exception
out in such a situation rather than just sit there, but I could be wrong.
Not sure what additional information may be needed by the community to
aid in troubleshooting, but will be happy to provide whatever else is
needed.
Kind regards,
Chris Hazelrig
SimTech


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users