Re: [gridengine users] QRSH/QRLOGIN ignores queue level h_rt limit

2020-02-03 Thread Derrick Lin
it mentions to try to kill the process due to an exhausted > wallclock time? > > -- Reuti > > > > Am 28.01.2020 um 03:50 schrieb Derrick Lin : > > > > Hi Reuti > > > > No, we haven't configured qlogin, rlogin specifically, so their settings

Re: [gridengine users] QRSH/QRLOGIN ignores queue level h_rt limit

2020-01-27 Thread Derrick Lin
ltin rsh_daemon builtin Cheers, Derrick On Fri, Jan 24, 2020 at 11:26 PM Reuti wrote: > Hi, > > > Am 24.01.2020 um 04:26 schrieb Derrick Lin : > > > > Hi guys, > > > > We have set a h_rt limit to be 48 hours in the queue, it seems that this &

[gridengine users] QRSH/QRLOGIN ignores queue level h_rt limit

2020-01-23 Thread Derrick Lin
Hi guys, We have set a h_rt limit to be 48 hours in the queue, it seems that this limit is applied on normal qsub job only. Now I am having few QRSh/QRLOGIN sessions live on the compute nodes for much longer than 48 hours. I am wondering if this is a known issue? I am running open source version

Re: [gridengine users] Different ulimit settings given by different compute nodes with the exactly same /etc/security/limits.conf

2019-07-15 Thread Derrick Lin
c//limits)? It's possible that the user who ran the > execd init script had limits applied, which would carry over to the execd > process. > > On Wed, Jul 03, 2019 at 12:36:00PM +1000, Derrick Lin wrote: > > Hi guys, > > > > We have custom settings for user open file

[gridengine users] Different ulimit settings given by different compute nodes with the exactly same /etc/security/limits.conf

2019-07-02 Thread Derrick Lin
Hi guys, We have custom settings for user open files in /etc/security/limits.conf in all Compute Node. When checking if the configuration is effective with "ulimit -a" by SSH to each node, it reflects the correct settings. but when ran the same command through SGE (both qsub and qrsh), we found t

Re: [gridengine users] Accessing qacct accounting file from login/compute nodes

2019-02-20 Thread Derrick Lin
n general, you should be able to access it from there? > > > > (Note that you can also tell qacct where the accounting file lives - it > > assumes a default location, but the file does not have be in that > location.) > > > > Tina > > > > On 20/02/2019 07:09, Reut

Re: [gridengine users] Accessing qacct accounting file from login/compute nodes

2019-02-20 Thread Derrick Lin
Am 20.02.2019 um 05:31 schrieb Derrick Lin : > > > > Hi guys, > > > > On our SGE cluster, the accounting file stored on the qmaster node and > is not accessible outside. qmaster node is not accessible by any user > either. > > > > Now we have users request to

[gridengine users] Accessing qacct accounting file from login/compute nodes

2019-02-19 Thread Derrick Lin
Hi guys, On our SGE cluster, the accounting file stored on the qmaster node and is not accessible outside. qmaster node is not accessible by any user either. Now we have users request to obtain accounting info via qacct. I am wondering what is the common way to achieve this without giving access

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-10 Thread Derrick Lin
um 23:35 schrieb Derrick Lin: > > > Hi Reuti, > > > > I have to say I am still not familiar with the "-i" in qsub after > reading the man page, what does it do? > > It will be feed as stdin to the jobscript. Hence: > > $ qsub -i myfile foo.sh > >

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-09 Thread Derrick Lin
on SGE 8.1.9 not SGE 2011.11p1. Maybe it is worthwhile to mention that the new SGE cluster is CentOS7 based and the old one is CentOS6. Not sure if this also matters Cheers, Derrick On Thu, Jan 10, 2019 at 9:39 AM Derrick Lin wrote: > Hi Reuti and Iyad, > > Here is my prolog script

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-09 Thread Derrick Lin
luffy, fd_pty_master = 6, fd_pipe_in = -1, fd_pipe_out = -1, fd_pipe_err = -1, fd_pipe_to_child = 5 ### more lines omitted On Thu, Jan 10, 2019 at 9:35 AM Derrick Lin wrote: > Hi Reuti, > > I have to say I am still not familiar with the "-i" in qsub after reading > the ma

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-09 Thread Derrick Lin
urther messages are in "error" and "trace" 01/10/2019 09:20:22 [6782:315345]: using stdout as stderr 01/10/2019 09:20:22 [6782:315345]: now running with uid=6782, euid=6782 01/10/2019 09:20:22 [6782:315345]: execvlp(/bin/csh, "-csh" "-c" "sleep 10m "

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-09 Thread Derrick Lin
a $JOB_ID' $SGE_TMP_ROOT" xfs_quota_rc=0 /usr/sbin/xfs_quota -x -c "project -s -p $TMP $JOB_ID" $SGE_TMP_ROOT ((xfs_quota_rc+=$?)) /usr/sbin/xfs_quota -x -c "limit -p bhard=$quota $JOB_ID" $SGE_TMP_ROOT ((xfs_quota_rc+=$?)) if [ $xfs_quota_rc -eq 0 ]; then exit

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-08 Thread Derrick Lin
ccessfully scheduled. Your interactive job 18 has been successfully scheduled. But the symptom at the backend compute node is the same, prolog log file generated but is empty. Cheers, Derrick On Wed, Jan 9, 2019 at 11:14 AM Derrick Lin wrote: > Hi guys, > > I just brought up a new SGE c

[gridengine users] qrsh session failed to execute prolog script?

2019-01-08 Thread Derrick Lin
Hi guys, I just brought up a new SGE cluster, but somehow the qrsh session does not work: tester@login-gpu:~$ qrsh ^Cerror: error while waiting for builtin IJS connection: "got select timeout" after I hit entered, the session just stuck there forever instead of bring me to a compute node. I have

Re: [gridengine users] TMPDIR is missing from prolog script (CentOS 7 SGE 8.1.9)

2018-12-07 Thread Derrick Lin
3:49 AM Reuti wrote: > > > Am 06.12.2018 um 23:52 schrieb Derrick Lin : > > > > Hi all, > > > > We are switching to a cluster of CentOS7 with SGE 8.1.9 installed. > > > > We have a prolog script that does XFS disk space allocation according to >

Re: [gridengine users] TMPDIR is missing from prolog script (CentOS 7 SGE 8.1.9)

2018-12-06 Thread Derrick Lin
.all.q 200+0 records in 200+0 records out 107374182400 bytes (107 GB) copied, 73.2242 s, 1.5 GB/s So basically only prolog script has problem. Cheers, Derrick On Fri, Dec 7, 2018 at 9:52 AM Derrick Lin wrote: > Hi all, > > We are switching to a cluster of CentOS7 with SGE 8.1.9 insta

[gridengine users] TMPDIR is missing from prolog script (CentOS 7 SGE 8.1.9)

2018-12-06 Thread Derrick Lin
Hi all, We are switching to a cluster of CentOS7 with SGE 8.1.9 installed. We have a prolog script that does XFS disk space allocation according to TMPDIR. However, the prolog script does not receive TMPDIR which should be created by the scheduler. Other variables such as JOB_ID, PE_HOSTFILE ar

Re: [gridengine users] User job fails silently

2018-08-08 Thread Derrick Lin
ve. We don't see any storage related contention. I am more interested in knowing where this process bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671 come from? Cheers, On Wed, Aug 8, 2018 at 6:53 PM, Reuti wrote: > > > Am 08.08.2018 um 08:15 schrieb Derrick L

[gridengine users] User job fails silently

2018-08-07 Thread Derrick Lin
Hi guys, I have a user reported his jobs stuck running for much longer than usual. So I go to the exec host, check the process and all processes owned by that user look like: `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671 In qstat, it still shows job is in running state.

Re: [gridengine users] Start jobs on exec host in sequential order

2018-08-07 Thread Derrick Lin
Thanks guys, I will take a look at each option. On Mon, Aug 6, 2018 at 9:52 PM, William Hay wrote: > On Wed, Aug 01, 2018 at 11:06:19AM +1000, Derrick Lin wrote: > >HI Reuti, > >The prolog script is set to run by root indeed. The xfs quota requires > >root pri

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-31 Thread Derrick Lin
/default/spool/omega-1-27/active_jobs/1187086.1/addgrpid: No such file or directory Maybe some of my scheduler conf is not correct? Regards, Derrick On Mon, Jul 30, 2018 at 7:35 PM, Reuti wrote: > > > Am 30.07.2018 um 02:31 schrieb Derrick Lin : > > > > Hi Reuti, > > &g

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-29 Thread Derrick Lin
On Sat, Jul 28, 2018 at 11:53 AM, Reuti wrote: > > > Am 28.07.2018 um 03:00 schrieb Derrick Lin : > > > > Thanks Reuti, > > > > I know little about group ID created by SGE, and also pretty much > confused with the Linux group ID. > > Yes, SGE assigns a conv

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-27 Thread Derrick Lin
07.2018 um 03:14 schrieb Derrick Lin: > > > We are using $JOB_ID as xfs_projid at the moment, but this approach > introduces problem to array jobs whose tasks have the same $JOB_ID (with > different $TASK_ID). > > > > Also it is possible that tasks from two different array j

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-26 Thread Derrick Lin
$TASK_ID on the same host cannot be maintained. That's why I am trying to implement the xfs_projid to be independent from SGE. On Thu, Jul 26, 2018 at 9:27 PM, Reuti wrote: > Hi, > > > Am 26.07.2018 um 06:01 schrieb Derrick Lin : > > > > Hi all, > > > &

[gridengine users] Start jobs on exec host in sequential order

2018-07-25 Thread Derrick Lin
Hi all, I am working on a prolog script which setup xfs quota on disk space per job basis. For setting up xfs quota in sub directory, I need to provide project ID. Here is how I did for generating project ID: XFS_PROJID_CF="/tmp/xfs_projid_counter" echo $JOB_ID >> $XFS_PROJID_CF xfs_projid=$(w

[gridengine users] qrlogin session not affected by queue's h_rt

2018-04-09 Thread Derrick Lin
Hi all, I have one user's qrlogin/qrsh session started on 4th April, but it is still active today (10th April) - short.q@node-5-3.localBIP 0/16/5612.87linux-x64 d 1017237 10.00965 QRLOGIN

Re: [gridengine users] Job killed instantly due to h_vmem exceeds hard limit

2017-11-05 Thread Derrick Lin
euti wrote: > Hi, > > Am 02.11.2017 um 11:39 schrieb Derrick Lin: > > > Hi Reuti, > > > > One of the users indicates -S was used in his job: > > > > qsub -P RNABiologyandPlasticity -cwd -V -pe smp 1 -N CyborgSummer -S > /bin/bash -t 1-11 -v mem_requested=12

Re: [gridengine users] Job killed instantly due to h_vmem exceeds hard limit

2017-11-02 Thread Derrick Lin
ew of them have much higher maxvmem value? Regards, Derrick On Thu, Nov 2, 2017 at 6:17 PM, Reuti wrote: > Hi, > > > Am 02.11.2017 um 04:54 schrieb Derrick Lin : > > > > Dear all, > > > > Recently, I have users reported some of their jobs failed silently. I >

[gridengine users] Job killed instantly due to h_vmem exceeds hard limit

2017-11-01 Thread Derrick Lin
Dear all, Recently, I have users reported some of their jobs failed silently. I picked one up and check, found: 11/02/2017 05:30:18| main|delta-5-3|W|job 610608 exceeds job hard limit "h_vmem" of queue "short.q@delta-5-3.local" (8942456832.0 > limit:8589934592.0) - sending SIGKILL [root

Re: [gridengine users] can't delete an exec host

2017-10-06 Thread Derrick Lin
Try qconf -de host_list Cheers, On Thu, Sep 7, 2017 at 3:22 AM, Michael Stauffer wrote: > On Wed, Sep 6, 2017 at 12:42 PM, Reuti wrote: > >> >> > Am 06.09.2017 um 17:33 schrieb Michael Stauffer : >> > >> > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang >> wrote: >> > It seems SGE master did not

Re: [gridengine users] Control tmpdir usage on SGE

2016-10-04 Thread Derrick Lin
mable value). I will experiment it a bit more. Cheers, D On Tue, Oct 4, 2016 at 5:49 PM, Reuti wrote: > > Am 04.10.2016 um 03:41 schrieb Derrick Lin: > > > Hi all again, > > > > I have had a simple implementation working. Now I need to look at a > situation when

Re: [gridengine users] Control tmpdir usage on SGE

2016-10-03 Thread Derrick Lin
situation, the quota should be created in a script that is specified in start_proc_args instead of prolog? Thanks On Tue, Sep 13, 2016 at 5:51 PM, William Hay wrote: > On Tue, Sep 13, 2016 at 03:15:19PM +1000, Derrick Lin wrote: > >Thanks guys, > >I am implementing

[gridengine users] inst_sge does not backup custom queue and userset

2016-09-12 Thread Derrick Lin
Hi all, I have tried the SGE conf backup solution based on inst_sge, but found that it seems to be backing up original components only. All my custom queue, PE and userset are not backup... Is it configurable so that it can backup **everything**? Cheers, Derrick

Re: [gridengine users] Control tmpdir usage on SGE

2016-09-12 Thread Derrick Lin
Thanks guys, I am implementing the solution as outlined by William, except we are using XFS here, so we are trying to do it by using XFS's project/directory quota. Will do more testing and see how it goes.. Cheers, Derrick On Fri, Sep 9, 2016 at 11:05 PM, William Hay wrote: > On Fri, Sep 09, 2

Re: [gridengine users] Control tmpdir usage on SGE

2016-09-08 Thread Derrick Lin
ested=100G), can the prolog script simply uses that value? Cheers, D On Thu, Sep 8, 2016 at 11:00 PM, William Hay wrote: > On Thu, Sep 08, 2016 at 10:10:51AM +1000, Derrick Lin wrote: > >Hi all, > >Each of our execution nodes has a scratch space mounted as > /scratch

Re: [gridengine users] Control tmpdir usage on SGE

2016-09-08 Thread Derrick Lin
08, 2016 at 10:10:51AM +1000, Derrick Lin wrote: > >Hi all, > >Each of our execution nodes has a scratch space mounted as > /scratch_local. > >I notice there is tmpdir variable can be changed in a queue's conf. > >According to doc, SGE will create a per j

[gridengine users] Control tmpdir usage on SGE

2016-09-07 Thread Derrick Lin
Hi all, Each of our execution nodes has a scratch space mounted as /scratch_local. I notice there is tmpdir variable can be changed in a queue's conf. According to doc, SGE will create a per job dir on tmpdir, and set path in var TMPDIR and TMP. I have setup a complex tmp_requested which a job c

Re: [gridengine users] SGE error: commlib error due to not identical clients host name

2015-09-09 Thread Derrick Lin
, 2015 at 7:19 PM, Reuti wrote: > Hi, > > > Am 08.09.2015 um 09:23 schrieb Derrick Lin : > > > > Hi guys, > > > > Thanks for the helps. I ran the SGE tools on the qmaster, and found the > issue: > > > > [root@alpha01 lx26-amd64]# ./gethostname >

Re: [gridengine users] SGE error: commlib error due to not identical clients host name

2015-09-08 Thread Derrick Lin
all I have made changes in the qmaster recently. Where I should be looking at to fix this issue? Regards, Derrick On Mon, Sep 7, 2015 at 3:04 PM, Reuti wrote: > > Am 07.09.2015 um 00:36 schrieb Derrick Lin: > > > Hi Simon, > > > > It looks normal: > > > &g

Re: [gridengine users] SGE error: commlib error due to not identical clients host name

2015-09-06 Thread Derrick Lin
, Simon Matthews wrote: > What does the rDNS show for the IP address of alpha01.local? > > Simon > > On Thu, Sep 3, 2015 at 6:44 PM, Derrick Lin wrote: > > Dear all, > > > > I have been having issue on executing all SGE commands on the qmaster, > > typica

[gridengine users] SGE error: commlib error due to not identical clients host name

2015-09-03 Thread Derrick Lin
Dear all, I have been having issue on executing all SGE commands on the qmaster, typically, it gives such error: [root@alpha01 ~]# qconf -sc error: commlib error: access denied (client IP resolved to host name "alpha01.local". This is not identical to clients host name "omega-0-12.local") DNS is

Re: [gridengine users] Using ssh with qrsh and qlogin but disable users direct ssh

2014-10-13 Thread Derrick Lin
avoid this - > the qlogin_daemon command is 'ssh -i -f path_to_config'. > > The 'standard' sshd.conf on the nodes does not allow login for users, but > the one the qlogin_daemon points to does. > > Tina > > > On 30/09/14 02:59, Derrick Lin wrote: >

[gridengine users] Using ssh with qrsh and qlogin but disable users direct ssh

2014-10-13 Thread Derrick Lin
hi guys, I am trying to configure SSH as underlying protocol for qrsh, qlogin. However, this requires allowing users to SSH into compute nodes. In such case, users can simply go to compute nodes with SSH, bypassing SGE (qrsh, qlogin etc). I am wondering what the best way to configure SSH to servi

Re: [gridengine users] Defining multiple RQS

2014-08-26 Thread Derrick Lin
> Hi, > > Am 26.08.2014 um 07:17 schrieb Derrick Lin: > > > I currently have one RQS that defines default slots quota for every user. > > > > Now I want to add new quota for specific userset (Department), I can > either add a new limit rule inside the existing RQS

[gridengine users] Defining multiple RQS

2014-08-25 Thread Derrick Lin
hi all, I currently have one RQS that defines default slots quota for every user. Now I want to add new quota for specific userset (Department), I can either add a new limit rule inside the existing RQS or create a new separated RQS. I am wondering what the difference is between them? Regards, D

[gridengine users] Access List: ACL or DEPT or both

2014-08-06 Thread Derrick Lin
Hi guys, My cluster have several ACL type of usersets for controlling queue access permissions. Recently I added the functional policy to the cluster so I changed the type from ACL to DEPT. Now I found that the queue access permission is no longer working as before. The access_list manual doesn't

Re: [gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

2014-08-03 Thread Derrick Lin
t 7:35 PM, Reuti wrote: > Am 01.08.2014 um 01:39 schrieb Derrick Lin: > > > Do you have > > > > paramsMONITOR=1?? > > No: > > $ qconf -ssconf > ... > paramsnone > > > > This is what gave

Re: [gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

2014-07-31 Thread Derrick Lin
Do you have paramsMONITOR=1?? This is what gave me the same error. I am running GE 6.2u5 as well D On Thu, Jul 31, 2014 at 8:53 PM, Reuti wrote: > Am 31.07.2014 um 03:06 schrieb Derrick Lin: > > > Hi Reuti, > > > > That's interest

Re: [gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

2014-07-30 Thread Derrick Lin
rs b** queues all.q Is it illegal to set h_vmem in per user quota in the first place? Cheers, D On Wed, Jul 30, 2014 at 4:37 PM, Reuti wrote: > Hi, > > Am 30.07.2014 um 03:29 schrieb Derrick Lin: > > > **No** initial value per queue instance, I force the users to specify >

Re: [gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

2014-07-29 Thread Derrick Lin
hosts. My original issue was, when I set params MONITOR=1 jobs failed to start. Now I have MONITOR=1 removed, all jobs start and run fine. Any idea? D On Tue, Jul 29, 2014 at 7:43 PM, Reuti wrote: > Hi, > > Am 29.07.2014 um 06:07 schrieb Derrick Lin: > > > This is qhost of o

Re: [gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

2014-07-28 Thread Derrick Lin
rote: > Hi, > > Am 04.07.2014 um 06:04 schrieb Derrick Lin: > > > Interestingly, I have a small test cluster that basically have the same > SGE setup does *not* have such problem. h_vmem in complex is exactly the > same. The test queue instance looks almost the same (except

Re: [gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

2014-07-07 Thread Derrick Lin
0 Jobs start and run fine. Anyone can explain why these settings are related to job resource request? Cheers, Derrick On Fri, Jul 4, 2014 at 2:04 PM, Derrick Lin wrote: > Interestingly, I have a small test cluster that basically have the same > SGE setup does *not* hav

Re: [gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

2014-07-03 Thread Derrick Lin
ed in exechost level. Derrick On Fri, Jul 4, 2014 at 1:58 PM, Derrick Lin wrote: > Hi all, > > We start using h_vmem to control jobs by their memory usage. However jobs > couldn't start when there is -l h_vmem. The reason is > > (-l h_vmem=1G) cannot run in queue "

[gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

2014-07-03 Thread Derrick Lin
Hi all, We start using h_vmem to control jobs by their memory usage. However jobs couldn't start when there is -l h_vmem. The reason is (-l h_vmem=1G) cannot run in queue "intel.q@delta-5-1.local" because job requests unknown resource (h_vmem) However, h_vmem is definitely on the queue instance:

Re: [gridengine users] sge_request does not add default resource request to qsub jobs

2014-07-03 Thread Derrick Lin
Hi Arnau, Indeed! It affects only the nodes have this file! Problem solved. Thanks D On Wed, Jul 2, 2014 at 6:57 PM, Arnau Bria wrote: > On Wed, 2 Jul 2014 11:52:05 +1000 > Derrick Lin wrote: > > > Hi all, > Hi, > [...] > > mem_requested=1G in SGE_ROOT/defaul

[gridengine users] sge_request does not add default resource request to qsub jobs

2014-07-01 Thread Derrick Lin
Hi all, I have one application that relies on a custom Complex attr called "mem_requested", and configured for all compute nodes: $ qhost -F mem_requested HOSTNAMEARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS --

Re: [gridengine users] Enforce users to use specific amount of memory/slot

2014-06-30 Thread Derrick Lin
gt; > For example, if the user requests 400GB of RAM, the JSV will perform > the 400/8 = 50 cores, and then rewrites it as a request for 50 cores > as well. This will decrease the user's available slots to 142. > > Ian > >> On Mon, Jun 30, 2014 at 1:48 PM, Derrick Li

Re: [gridengine users] Enforce users to use specific amount of memory/slot

2014-06-30 Thread Derrick Lin
- > 400GB is in use (and 50 cores are also "in use" even though 49 are > idle), and other jobs either run somewhere else, or queue up. > > Ian > > On Mon, Jun 30, 2014 at 12:01 PM, Michael Stauffer wrote: >>> Message: 4 >>> Date: Mon, 30 Jun 2014 11:5

[gridengine users] Enforce users to use specific amount of memory/slot

2014-06-29 Thread Derrick Lin
Hi guys, A typical node on our cluster has 64 cores and 512GB memory. So it's about 8GB/core. Occasionally, we have some jobs that utilizes only 1 core but 400-500GB of memory, that annoys lots of users. So I am seeking a way that can force jobs to run strictly below 8GB/core ration or it should b