it mentions to try to kill the process due to an exhausted
> wallclock time?
>
> -- Reuti
>
>
> > Am 28.01.2020 um 03:50 schrieb Derrick Lin :
> >
> > Hi Reuti
> >
> > No, we haven't configured qlogin, rlogin specifically, so their settings
ltin
rsh_daemon builtin
Cheers,
Derrick
On Fri, Jan 24, 2020 at 11:26 PM Reuti wrote:
> Hi,
>
> > Am 24.01.2020 um 04:26 schrieb Derrick Lin :
> >
> > Hi guys,
> >
> > We have set a h_rt limit to be 48 hours in the queue, it seems that this
&
Hi guys,
We have set a h_rt limit to be 48 hours in the queue, it seems that this
limit is applied on normal qsub job only. Now I am having few QRSh/QRLOGIN
sessions live on the compute nodes for much longer than 48 hours.
I am wondering if this is a known issue?
I am running open source version
c//limits)? It's possible that the user who ran the
> execd init script had limits applied, which would carry over to the execd
> process.
>
> On Wed, Jul 03, 2019 at 12:36:00PM +1000, Derrick Lin wrote:
> > Hi guys,
> >
> > We have custom settings for user open file
Hi guys,
We have custom settings for user open files in /etc/security/limits.conf in
all Compute Node. When checking if the configuration is effective with
"ulimit -a" by SSH to each node, it reflects the correct settings.
but when ran the same command through SGE (both qsub and qrsh), we found
t
n general, you should be able to access it from there?
> >
> > (Note that you can also tell qacct where the accounting file lives - it
> > assumes a default location, but the file does not have be in that
> location.)
> >
> > Tina
> >
> > On 20/02/2019 07:09, Reut
Am 20.02.2019 um 05:31 schrieb Derrick Lin :
> >
> > Hi guys,
> >
> > On our SGE cluster, the accounting file stored on the qmaster node and
> is not accessible outside. qmaster node is not accessible by any user
> either.
> >
> > Now we have users request to
Hi guys,
On our SGE cluster, the accounting file stored on the qmaster node and is
not accessible outside. qmaster node is not accessible by any user either.
Now we have users request to obtain accounting info via qacct. I am
wondering what is the common way to achieve this without giving access
um 23:35 schrieb Derrick Lin:
>
> > Hi Reuti,
> >
> > I have to say I am still not familiar with the "-i" in qsub after
> reading the man page, what does it do?
>
> It will be feed as stdin to the jobscript. Hence:
>
> $ qsub -i myfile foo.sh
>
>
on SGE 8.1.9 not SGE 2011.11p1.
Maybe it is worthwhile to mention that the new SGE cluster is CentOS7 based
and the old one is CentOS6. Not sure if this also matters
Cheers,
Derrick
On Thu, Jan 10, 2019 at 9:39 AM Derrick Lin wrote:
> Hi Reuti and Iyad,
>
> Here is my prolog script
luffy,
fd_pty_master = 6, fd_pipe_in = -1, fd_pipe_out = -1, fd_pipe_err = -1,
fd_pipe_to_child = 5
### more lines omitted
On Thu, Jan 10, 2019 at 9:35 AM Derrick Lin wrote:
> Hi Reuti,
>
> I have to say I am still not familiar with the "-i" in qsub after reading
> the ma
urther messages are in "error" and
"trace"
01/10/2019 09:20:22 [6782:315345]: using stdout as stderr
01/10/2019 09:20:22 [6782:315345]: now running with uid=6782, euid=6782
01/10/2019 09:20:22 [6782:315345]: execvlp(/bin/csh, "-csh" "-c" "sleep 10m
"
a
$JOB_ID' $SGE_TMP_ROOT"
xfs_quota_rc=0
/usr/sbin/xfs_quota -x -c "project -s -p $TMP $JOB_ID" $SGE_TMP_ROOT
((xfs_quota_rc+=$?))
/usr/sbin/xfs_quota -x -c "limit -p bhard=$quota $JOB_ID" $SGE_TMP_ROOT
((xfs_quota_rc+=$?))
if [ $xfs_quota_rc -eq 0 ]; then
exit
ccessfully scheduled.
Your interactive job 18 has been successfully scheduled.
But the symptom at the backend compute node is the same, prolog log file
generated but is empty.
Cheers,
Derrick
On Wed, Jan 9, 2019 at 11:14 AM Derrick Lin wrote:
> Hi guys,
>
> I just brought up a new SGE c
Hi guys,
I just brought up a new SGE cluster, but somehow the qrsh session does not
work:
tester@login-gpu:~$ qrsh
^Cerror: error while waiting for builtin IJS connection: "got select
timeout"
after I hit entered, the session just stuck there forever instead of bring
me to a compute node. I have
3:49 AM Reuti wrote:
>
> > Am 06.12.2018 um 23:52 schrieb Derrick Lin :
> >
> > Hi all,
> >
> > We are switching to a cluster of CentOS7 with SGE 8.1.9 installed.
> >
> > We have a prolog script that does XFS disk space allocation according to
>
.all.q
200+0 records in
200+0 records out
107374182400 bytes (107 GB) copied, 73.2242 s, 1.5 GB/s
So basically only prolog script has problem.
Cheers,
Derrick
On Fri, Dec 7, 2018 at 9:52 AM Derrick Lin wrote:
> Hi all,
>
> We are switching to a cluster of CentOS7 with SGE 8.1.9 insta
Hi all,
We are switching to a cluster of CentOS7 with SGE 8.1.9 installed.
We have a prolog script that does XFS disk space allocation according to
TMPDIR.
However, the prolog script does not receive TMPDIR which should be created
by the scheduler.
Other variables such as JOB_ID, PE_HOSTFILE ar
ve.
We don't see any storage related contention.
I am more interested in knowing where this process
bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671 come from?
Cheers,
On Wed, Aug 8, 2018 at 6:53 PM, Reuti wrote:
>
> > Am 08.08.2018 um 08:15 schrieb Derrick L
Hi guys,
I have a user reported his jobs stuck running for much longer than usual.
So I go to the exec host, check the process and all processes owned by that
user look like:
`- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671
In qstat, it still shows job is in running state.
Thanks guys, I will take a look at each option.
On Mon, Aug 6, 2018 at 9:52 PM, William Hay wrote:
> On Wed, Aug 01, 2018 at 11:06:19AM +1000, Derrick Lin wrote:
> >HI Reuti,
> >The prolog script is set to run by root indeed. The xfs quota requires
> >root pri
/default/spool/omega-1-27/active_jobs/1187086.1/addgrpid: No
such file or directory
Maybe some of my scheduler conf is not correct?
Regards,
Derrick
On Mon, Jul 30, 2018 at 7:35 PM, Reuti wrote:
>
> > Am 30.07.2018 um 02:31 schrieb Derrick Lin :
> >
> > Hi Reuti,
> >
&g
On Sat, Jul 28, 2018 at 11:53 AM, Reuti wrote:
>
> > Am 28.07.2018 um 03:00 schrieb Derrick Lin :
> >
> > Thanks Reuti,
> >
> > I know little about group ID created by SGE, and also pretty much
> confused with the Linux group ID.
>
> Yes, SGE assigns a conv
07.2018 um 03:14 schrieb Derrick Lin:
>
> > We are using $JOB_ID as xfs_projid at the moment, but this approach
> introduces problem to array jobs whose tasks have the same $JOB_ID (with
> different $TASK_ID).
> >
> > Also it is possible that tasks from two different array j
$TASK_ID on
the same host cannot be maintained.
That's why I am trying to implement the xfs_projid to be independent from
SGE.
On Thu, Jul 26, 2018 at 9:27 PM, Reuti wrote:
> Hi,
>
> > Am 26.07.2018 um 06:01 schrieb Derrick Lin :
> >
> > Hi all,
> >
> &
Hi all,
I am working on a prolog script which setup xfs quota on disk space per job
basis.
For setting up xfs quota in sub directory, I need to provide project ID.
Here is how I did for generating project ID:
XFS_PROJID_CF="/tmp/xfs_projid_counter"
echo $JOB_ID >> $XFS_PROJID_CF
xfs_projid=$(w
Hi all,
I have one user's qrlogin/qrsh session started on 4th April, but it is
still active today (10th April)
-
short.q@node-5-3.localBIP 0/16/5612.87linux-x64 d
1017237 10.00965 QRLOGIN
euti wrote:
> Hi,
>
> Am 02.11.2017 um 11:39 schrieb Derrick Lin:
>
> > Hi Reuti,
> >
> > One of the users indicates -S was used in his job:
> >
> > qsub -P RNABiologyandPlasticity -cwd -V -pe smp 1 -N CyborgSummer -S
> /bin/bash -t 1-11 -v mem_requested=12
ew of them have much higher maxvmem value?
Regards,
Derrick
On Thu, Nov 2, 2017 at 6:17 PM, Reuti wrote:
> Hi,
>
> > Am 02.11.2017 um 04:54 schrieb Derrick Lin :
> >
> > Dear all,
> >
> > Recently, I have users reported some of their jobs failed silently. I
>
Dear all,
Recently, I have users reported some of their jobs failed silently. I
picked one up and check, found:
11/02/2017 05:30:18| main|delta-5-3|W|job 610608 exceeds job hard limit
"h_vmem" of queue "short.q@delta-5-3.local" (8942456832.0 >
limit:8589934592.0) - sending SIGKILL
[root
Try
qconf -de host_list
Cheers,
On Thu, Sep 7, 2017 at 3:22 AM, Michael Stauffer wrote:
> On Wed, Sep 6, 2017 at 12:42 PM, Reuti wrote:
>
>>
>> > Am 06.09.2017 um 17:33 schrieb Michael Stauffer :
>> >
>> > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang
>> wrote:
>> > It seems SGE master did not
mable value).
I will experiment it a bit more.
Cheers,
D
On Tue, Oct 4, 2016 at 5:49 PM, Reuti wrote:
>
> Am 04.10.2016 um 03:41 schrieb Derrick Lin:
>
> > Hi all again,
> >
> > I have had a simple implementation working. Now I need to look at a
> situation when
situation, the quota should be created in a script
that is specified in start_proc_args instead of prolog?
Thanks
On Tue, Sep 13, 2016 at 5:51 PM, William Hay wrote:
> On Tue, Sep 13, 2016 at 03:15:19PM +1000, Derrick Lin wrote:
> >Thanks guys,
> >I am implementing
Hi all,
I have tried the SGE conf backup solution based on inst_sge, but found that
it seems to be backing up original components only.
All my custom queue, PE and userset are not backup...
Is it configurable so that it can backup **everything**?
Cheers,
Derrick
Thanks guys,
I am implementing the solution as outlined by William, except we are using
XFS here, so we are trying to do it by using XFS's project/directory quota.
Will do more testing and see how it goes..
Cheers,
Derrick
On Fri, Sep 9, 2016 at 11:05 PM, William Hay wrote:
> On Fri, Sep 09, 2
ested=100G), can the prolog script simply uses that value?
Cheers,
D
On Thu, Sep 8, 2016 at 11:00 PM, William Hay wrote:
> On Thu, Sep 08, 2016 at 10:10:51AM +1000, Derrick Lin wrote:
> >Hi all,
> >Each of our execution nodes has a scratch space mounted as
> /scratch
08, 2016 at 10:10:51AM +1000, Derrick Lin wrote:
> >Hi all,
> >Each of our execution nodes has a scratch space mounted as
> /scratch_local.
> >I notice there is tmpdir variable can be changed in a queue's conf.
> >According to doc, SGE will create a per j
Hi all,
Each of our execution nodes has a scratch space mounted as /scratch_local.
I notice there is tmpdir variable can be changed in a queue's conf.
According to doc, SGE will create a per job dir on tmpdir, and set path in
var TMPDIR and TMP.
I have setup a complex tmp_requested which a job c
, 2015 at 7:19 PM, Reuti wrote:
> Hi,
>
> > Am 08.09.2015 um 09:23 schrieb Derrick Lin :
> >
> > Hi guys,
> >
> > Thanks for the helps. I ran the SGE tools on the qmaster, and found the
> issue:
> >
> > [root@alpha01 lx26-amd64]# ./gethostname
>
all I have
made changes in the qmaster recently.
Where I should be looking at to fix this issue?
Regards,
Derrick
On Mon, Sep 7, 2015 at 3:04 PM, Reuti wrote:
>
> Am 07.09.2015 um 00:36 schrieb Derrick Lin:
>
> > Hi Simon,
> >
> > It looks normal:
> >
> &g
, Simon Matthews
wrote:
> What does the rDNS show for the IP address of alpha01.local?
>
> Simon
>
> On Thu, Sep 3, 2015 at 6:44 PM, Derrick Lin wrote:
> > Dear all,
> >
> > I have been having issue on executing all SGE commands on the qmaster,
> > typica
Dear all,
I have been having issue on executing all SGE commands on the qmaster,
typically, it gives such error:
[root@alpha01 ~]# qconf -sc
error: commlib error: access denied (client IP resolved to host name
"alpha01.local". This is not identical to clients host name
"omega-0-12.local")
DNS is
avoid this -
> the qlogin_daemon command is 'ssh -i -f path_to_config'.
>
> The 'standard' sshd.conf on the nodes does not allow login for users, but
> the one the qlogin_daemon points to does.
>
> Tina
>
>
> On 30/09/14 02:59, Derrick Lin wrote:
>
hi guys,
I am trying to configure SSH as underlying protocol for qrsh, qlogin.
However, this requires allowing users to SSH into compute nodes. In such
case, users can simply go to compute nodes with SSH, bypassing SGE (qrsh,
qlogin etc).
I am wondering what the best way to configure SSH to servi
> Hi,
>
> Am 26.08.2014 um 07:17 schrieb Derrick Lin:
>
> > I currently have one RQS that defines default slots quota for every user.
> >
> > Now I want to add new quota for specific userset (Department), I can
> either add a new limit rule inside the existing RQS
hi all,
I currently have one RQS that defines default slots quota for every user.
Now I want to add new quota for specific userset (Department), I can either
add a new limit rule inside the existing RQS or create a new separated RQS.
I am wondering what the difference is between them?
Regards,
D
Hi guys,
My cluster have several ACL type of usersets for controlling queue access
permissions. Recently I added the functional policy to the cluster so I
changed the type from ACL to DEPT.
Now I found that the queue access permission is no longer working as
before. The access_list manual doesn't
t 7:35 PM, Reuti wrote:
> Am 01.08.2014 um 01:39 schrieb Derrick Lin:
>
> > Do you have
> >
> > paramsMONITOR=1??
>
> No:
>
> $ qconf -ssconf
> ...
> paramsnone
>
>
> > This is what gave
Do you have
paramsMONITOR=1??
This is what gave me the same error.
I am running GE 6.2u5 as well
D
On Thu, Jul 31, 2014 at 8:53 PM, Reuti wrote:
> Am 31.07.2014 um 03:06 schrieb Derrick Lin:
>
> > Hi Reuti,
> >
> > That's interest
rs b** queues all.q
Is it illegal to set h_vmem in per user quota in the first place?
Cheers,
D
On Wed, Jul 30, 2014 at 4:37 PM, Reuti wrote:
> Hi,
>
> Am 30.07.2014 um 03:29 schrieb Derrick Lin:
>
> > **No** initial value per queue instance, I force the users to specify
>
hosts.
My original issue was, when I set params MONITOR=1 jobs failed to start.
Now I have MONITOR=1 removed, all jobs start and run fine. Any idea?
D
On Tue, Jul 29, 2014 at 7:43 PM, Reuti wrote:
> Hi,
>
> Am 29.07.2014 um 06:07 schrieb Derrick Lin:
>
> > This is qhost of o
rote:
> Hi,
>
> Am 04.07.2014 um 06:04 schrieb Derrick Lin:
>
> > Interestingly, I have a small test cluster that basically have the same
> SGE setup does *not* have such problem. h_vmem in complex is exactly the
> same. The test queue instance looks almost the same (except
0
Jobs start and run fine. Anyone can explain why these settings are related
to job resource request?
Cheers,
Derrick
On Fri, Jul 4, 2014 at 2:04 PM, Derrick Lin wrote:
> Interestingly, I have a small test cluster that basically have the same
> SGE setup does *not* hav
ed in exechost level.
Derrick
On Fri, Jul 4, 2014 at 1:58 PM, Derrick Lin wrote:
> Hi all,
>
> We start using h_vmem to control jobs by their memory usage. However jobs
> couldn't start when there is -l h_vmem. The reason is
>
> (-l h_vmem=1G) cannot run in queue "
Hi all,
We start using h_vmem to control jobs by their memory usage. However jobs
couldn't start when there is -l h_vmem. The reason is
(-l h_vmem=1G) cannot run in queue "intel.q@delta-5-1.local" because job
requests unknown resource (h_vmem)
However, h_vmem is definitely on the queue instance:
Hi Arnau,
Indeed! It affects only the nodes have this file! Problem solved.
Thanks
D
On Wed, Jul 2, 2014 at 6:57 PM, Arnau Bria wrote:
> On Wed, 2 Jul 2014 11:52:05 +1000
> Derrick Lin wrote:
>
> > Hi all,
> Hi,
> [...]
> > mem_requested=1G in SGE_ROOT/defaul
Hi all,
I have one application that relies on a custom Complex attr called
"mem_requested", and configured for all compute nodes:
$ qhost -F mem_requested
HOSTNAMEARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
SWAPUS
--
gt;
> For example, if the user requests 400GB of RAM, the JSV will perform
> the 400/8 = 50 cores, and then rewrites it as a request for 50 cores
> as well. This will decrease the user's available slots to 142.
>
> Ian
>
>> On Mon, Jun 30, 2014 at 1:48 PM, Derrick Li
-
> 400GB is in use (and 50 cores are also "in use" even though 49 are
> idle), and other jobs either run somewhere else, or queue up.
>
> Ian
>
> On Mon, Jun 30, 2014 at 12:01 PM, Michael Stauffer wrote:
>>> Message: 4
>>> Date: Mon, 30 Jun 2014 11:5
Hi guys,
A typical node on our cluster has 64 cores and 512GB memory. So it's about
8GB/core. Occasionally, we have some jobs that utilizes only 1 core but
400-500GB of memory, that annoys lots of users. So I am seeking a way that
can force jobs to run strictly below 8GB/core ration or it should b
60 matches
Mail list logo