[SGE-discuss] managing scratch space as a consumable/limited resource

2017-02-01 Thread mark . bergman
We're running S[o]GE 8.1.6 and I'm looking for suggestions re. managing
per-node local scratch space as a consumable resource.

Currently that temporary space is in a local disk on each node, with
directories named:

/scratch/${USER}

but this would be changed to something like:

/scratch/${USER}/${JOBID}${JOB_ARRAY_INDEX}

My aim is to have SGE manage scratch space as a resource similar to an
h_vmem resource request:

   1.   ensure the node has enough scratch space before running the job
df -h /scratch  must be greater than  $USER_SCRATCH_REQUEST

   2.   internally decrement the 'available' scratch space according to the
requested amount, even if bytes aren't written to disk yet

   3.   if the job exceeds the requested scratch space kill the job

   4.   clean the per-job scratch space when the job is finished
rm -rf /scratch/${USER}/${JOBID}


I understand that #1 will require a custom load sensor (df -sh /scratch). 

Feature #2 will require defining scratch space as a consumable complex,
with the amount of scratch space defined per-node -- that's not
hard. I'm a bit concerned about the overhead of SGE running 
du -sh /scratch/${USER}/${JOBID}${JOB_ARRAY_INDEX}
for each job (up to 40) on a node, crawling deep directory trees on a
single hard drive every $schedule_interval.

I believe that the presence of the complex will automatically cause
SGE to kill the job (#3) if the per-user limit is exceeded, but I'm
not sure how the load sensor will communicate per-job scratch directory
space consumption, rather than space used (or available) on the entire
scratch disk.

I'd appreciate suggestions on how to ensure that the sge_execd cleans
up the per-job scratch directory at the conclusion of a job. It would
be great if there was a flag to supress this behavior, in case scratch
files needed to be examined for debugging.

Thanks,

Mark
___
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss


[SGE-discuss] feature request: exempt sshd from SIGXCPU and SIGUSR1

2017-11-08 Thread mark . bergman
Scenario:
Interactive use of our cluster relies on qlogin. To limit long
idle login sessions and runaway processes, resource thresholds
for interactive jobs are set for s_rt, s_vmem and s_cpu to large
values (8hrs, 10GB, 15min), with the corresponding hard limits
being set even higher. The system-wide bash_profile traps SIGXCPU
and SIGUSR1 and sends the user a warning that they are approaching
a limit.

Problem:
SIGXCPU is sent to the sshd process initiated by qlogin. This is not
trapped, causing the login session to close without warning.

Requested enhancement:
Exempt the sshd initiated by qlogin from being sent any of the "soft"
resource quota signals.


-- 
Mark Bergman   voice: 215-746-4061  
 
mark.berg...@uphs.upenn.edu  fax: 215-614-0266
https://www.cbica.upenn.edu/
IT Technical Director, Center for Biomedical Image Computing and Analytics
Department of Radiology University of Pennsylvania
___
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss