If that could help Greg,
on the compute nodes I normally add this to /etc/security/limits.conf:
* - memlock -1
* - stack -1
* - nofile 32768
and
ulimit -n 32768
ulimit -l unlimited
ulimit -s unlimited
to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which
It isn't really Torque that is imposing those constraints:
- the torque_mom initscript inherits from the OS whatever ulimits are
in effect at that time;
- each job inherits the ulimits from the pbs_mom.
Thus, you need to change the ulimits from whatever is set at
startup time, e.g., in /etc/sysc
+1
On Jun 11, 2014, at 6:01 PM, Ralph Castain
wrote:
> Yeah, I think we've seen that somewhere before too...
>
>
> On Jun 11, 2014, at 2:59 PM, Joshua Ladd wrote:
>
>> Agreed. The problem is not with UDCM. I don't think something is wrong with
>> the system. I think his Torque is imposing
Yeah, I think we've seen that somewhere before too...
On Jun 11, 2014, at 2:59 PM, Joshua Ladd wrote:
> Agreed. The problem is not with UDCM. I don't think something is wrong with
> the system. I think his Torque is imposing major constraints on the maximum
> size that can be locked into memo
Agreed. The problem is not with UDCM. I don't think something is wrong with
the system. I think his Torque is imposing major constraints on the maximum
size that can be locked into memory.
Josh
On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm wrote:
> Probably won't help to use RDMACM though as y
Probably won't help to use RDMACM though as you will just see the
resource failure somewhere else. UDCM is not the problem. Something is
wrong with the system. Allocating a 512 entry CQ should not fail.
-Nathan
On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
>I'm guessing it's a
Okay, let me poke around some more. It is clearly tied to the coprocessors, but
I'm not yet sure just why.
One thing you might do is try the nightly 1.8.2 tarball - there have been a
number of fixes, and this may well have been caught there. Worth taking a look.
On Jun 11, 2014, at 6:44 AM, Da
I'm guessing it's a resource limitation issue coming from Torque.
H...I found something interesting on the interwebs that looks awfully
similar:
http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html
Greg, if the suggestion from the Torque users doesn't resolve your iss
Mellanox --
What would cause a CQ to fail to be created?
On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
wrote:
> Is there any other work around that I might try? Something that avoids UDCM?
>
> -Original Message-
> From: Fischer, Greg A.
> Sent: Tuesday, June 10, 2014 2:59 PM
> To:
Is there any other work around that I might try? Something that avoids UDCM?
-Original Message-
From: Fischer, Greg A.
Sent: Tuesday, June 10, 2014 2:59 PM
To: Nathan Hjelm
Cc: Open MPI Users; Fischer, Greg A.
Subject: RE: [OMPI users] openib segfaults with Torque
[binf316:fischega] $ ul
Sorry - it crashes with both torque and rsh launchers. The output from
a gdb backtrace on the core files looks identical.
Dan
On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote:
> Afraid I'm a little confused now - are you saying it works fine under Torque,
> but segfaults under rsh? Could yo
Afraid I'm a little confused now - are you saying it works fine under Torque,
but segfaults under rsh? Could you please clarify your current situation?
On Jun 11, 2014, at 6:27 AM, Dan Dietz wrote:
> It looks like it is still segfaulting with the rsh launcher:
>
> ddietz@conte-a084:/scratch/c
It looks like it is still segfaulting with the rsh launcher:
ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh
-np 4 -machinefile ./nodes ./hello
[conte-a084:51113] *** Process received signal ***
[conte-a084:51113] Signal: Segmentation fault (11)
[conte-a084:51113] Signal code:
13 matches
Mail list logo