[SGE-discuss] sgeexecd > sge master? (Was: Re: SGE Installation on Centos 7)

2017-04-29 Thread bergman
I've got a CentOS6 cluster happily running SoGE 8.1.6.

I'm adding CentOS7 nodes, and I'm considering
using the 8.1.9 RPMs available from Fedora COPR
(https://copr.fedorainfracloud.org/coprs/loveshack/SGE/package/gridengine/).

Are there any known issues or things to avoid when using an SGE execd
that's more recent than the SGE qmaster?

Thanks,

Mark
___
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss


[SGE-discuss] case-insensitive user names?

2018-04-12 Thread bergman
We're using SoGE 8.1.6 in an environment where users may login to the
cluster from a Linux workstation (typically using a lower-case login
name) or a Windows desktop, where their login name (as supplied by the
enterprise Active Directory) is usually mixed-case.

On the cluster, we've created two passwd entries per-user with an
identical UID, so there's no distinction in file ownership or any
permissions or access rights at the Linux shell level. Most users don't
notice (or care) about the case that's shown when they login.

However, SoGE seems to use the login name literally, not the UID.

This causes two problems:

job management
User "smithj" cannot manage (qdel, qalter) jobs
that they submitted as "SmithJ"


scheduler weighting
Using fair-share scheduling, John Smith will get
a disproportinate share of resources if he submits
jobs as both "smithj" and "SmithJ" vs. Jane Doe
who only submits jobs from her Linux machine as
"doej".

Is there a way to configure SoGE to treat login IDs with a
case-insensitive match, or to use UIDs?

We use a JSV pretty extensively, but I didn't see a way to alter login
names via a JSV -- any suggestions?

Thanks,

Mark


___
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss


[SGE-discuss] managing scratch space as a consumable/limited resource

2017-02-01 Thread mark . bergman
We're running S[o]GE 8.1.6 and I'm looking for suggestions re. managing
per-node local scratch space as a consumable resource.

Currently that temporary space is in a local disk on each node, with
directories named:

/scratch/${USER}

but this would be changed to something like:

/scratch/${USER}/${JOBID}${JOB_ARRAY_INDEX}

My aim is to have SGE manage scratch space as a resource similar to an
h_vmem resource request:

   1.   ensure the node has enough scratch space before running the job
df -h /scratch  must be greater than  $USER_SCRATCH_REQUEST

   2.   internally decrement the 'available' scratch space according to the
requested amount, even if bytes aren't written to disk yet

   3.   if the job exceeds the requested scratch space kill the job

   4.   clean the per-job scratch space when the job is finished
rm -rf /scratch/${USER}/${JOBID}


I understand that #1 will require a custom load sensor (df -sh /scratch). 

Feature #2 will require defining scratch space as a consumable complex,
with the amount of scratch space defined per-node -- that's not
hard. I'm a bit concerned about the overhead of SGE running 
du -sh /scratch/${USER}/${JOBID}${JOB_ARRAY_INDEX}
for each job (up to 40) on a node, crawling deep directory trees on a
single hard drive every $schedule_interval.

I believe that the presence of the complex will automatically cause
SGE to kill the job (#3) if the per-user limit is exceeded, but I'm
not sure how the load sensor will communicate per-job scratch directory
space consumption, rather than space used (or available) on the entire
scratch disk.

I'd appreciate suggestions on how to ensure that the sge_execd cleans
up the per-job scratch directory at the conclusion of a job. It would
be great if there was a flag to supress this behavior, in case scratch
files needed to be examined for debugging.

Thanks,

Mark
___
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss


[SGE-discuss] feature request: exempt sshd from SIGXCPU and SIGUSR1

2017-11-08 Thread mark . bergman
Scenario:
Interactive use of our cluster relies on qlogin. To limit long
idle login sessions and runaway processes, resource thresholds
for interactive jobs are set for s_rt, s_vmem and s_cpu to large
values (8hrs, 10GB, 15min), with the corresponding hard limits
being set even higher. The system-wide bash_profile traps SIGXCPU
and SIGUSR1 and sends the user a warning that they are approaching
a limit.

Problem:
SIGXCPU is sent to the sshd process initiated by qlogin. This is not
trapped, causing the login session to close without warning.

Requested enhancement:
Exempt the sshd initiated by qlogin from being sent any of the "soft"
resource quota signals.


-- 
Mark Bergman   voice: 215-746-4061  
 
mark.berg...@uphs.upenn.edu  fax: 215-614-0266
https://www.cbica.upenn.edu/
IT Technical Director, Center for Biomedical Image Computing and Analytics
Department of Radiology University of Pennsylvania
___
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss