date:20140611

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Gus Correa

If that could help Greg, on the compute nodes I normally add this to /etc/security/limits.conf: * - memlock -1 * - stack -1 * - nofile 32768 and ulimit -n 32768 ulimit -l unlimited ulimit -s unlimited to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Martin Siegert

It isn't really Torque that is imposing those constraints: - the torque_mom initscript inherits from the OS whatever ulimits are in effect at that time; - each job inherits the ulimits from the pbs_mom. Thus, you need to change the ulimits from whatever is set at startup time, e.g., in /etc/sysc

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Jeff Squyres (jsquyres)

+1 On Jun 11, 2014, at 6:01 PM, Ralph Castain wrote: > Yeah, I think we've seen that somewhere before too... > > > On Jun 11, 2014, at 2:59 PM, Joshua Ladd wrote: > >> Agreed. The problem is not with UDCM. I don't think something is wrong with >> the system. I think his Torque is imposing

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Ralph Castain

Yeah, I think we've seen that somewhere before too... On Jun 11, 2014, at 2:59 PM, Joshua Ladd wrote: > Agreed. The problem is not with UDCM. I don't think something is wrong with > the system. I think his Torque is imposing major constraints on the maximum > size that can be locked into memo

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd

Agreed. The problem is not with UDCM. I don't think something is wrong with the system. I think his Torque is imposing major constraints on the maximum size that can be locked into memory. Josh On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm wrote: > Probably won't help to use RDMACM though as y

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Nathan Hjelm

Probably won't help to use RDMACM though as you will just see the resource failure somewhere else. UDCM is not the problem. Something is wrong with the system. Allocating a 512 entry CQ should not fail. -Nathan On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: >I'm guessing it's a

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Ralph Castain

Okay, let me poke around some more. It is clearly tied to the coprocessors, but I'm not yet sure just why. One thing you might do is try the nightly 1.8.2 tarball - there have been a number of fixes, and this may well have been caught there. Worth taking a look. On Jun 11, 2014, at 6:44 AM, Da

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd

I'm guessing it's a resource limitation issue coming from Torque. H...I found something interesting on the interwebs that looks awfully similar: http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html Greg, if the suggestion from the Torque users doesn't resolve your iss

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Jeff Squyres (jsquyres)

Mellanox -- What would cause a CQ to fail to be created? On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." wrote: > Is there any other work around that I might try? Something that avoids UDCM? > > -Original Message- > From: Fischer, Greg A. > Sent: Tuesday, June 10, 2014 2:59 PM > To:

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Fischer, Greg A.

Is there any other work around that I might try? Something that avoids UDCM? -Original Message- From: Fischer, Greg A. Sent: Tuesday, June 10, 2014 2:59 PM To: Nathan Hjelm Cc: Open MPI Users; Fischer, Greg A. Subject: RE: [OMPI users] openib segfaults with Torque [binf316:fischega] $ ul

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Dan Dietz

Sorry - it crashes with both torque and rsh launchers. The output from a gdb backtrace on the core files looks identical. Dan On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote: > Afraid I'm a little confused now - are you saying it works fine under Torque, > but segfaults under rsh? Could yo

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Ralph Castain

Afraid I'm a little confused now - are you saying it works fine under Torque, but segfaults under rsh? Could you please clarify your current situation? On Jun 11, 2014, at 6:27 AM, Dan Dietz wrote: > It looks like it is still segfaulting with the rsh launcher: > > ddietz@conte-a084:/scratch/c

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Dan Dietz

It looks like it is still segfaulting with the rsh launcher: ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh -np 4 -machinefile ./nodes ./hello [conte-a084:51113] *** Process received signal *** [conte-a084:51113] Signal: Segmentation fault (11) [conte-a084:51113] Signal code:

Re: [OMPI users] openib segfaults with Torque

Re: [OMPI users] openib segfaults with Torque

Re: [OMPI users] openib segfaults with Torque

Re: [OMPI users] openib segfaults with Torque

Re: [OMPI users] openib segfaults with Torque

Re: [OMPI users] openib segfaults with Torque

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

Re: [OMPI users] openib segfaults with Torque

Re: [OMPI users] openib segfaults with Torque

Re: [OMPI users] openib segfaults with Torque

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

13 matches

Site Navigation

Mail list logo

Footer information