[SGE-discuss] sgeexecd > sge master? (Was: Re: SGE Installation on Centos 7)
I've got a CentOS6 cluster happily running SoGE 8.1.6. I'm adding CentOS7 nodes, and I'm considering using the 8.1.9 RPMs available from Fedora COPR (https://copr.fedorainfracloud.org/coprs/loveshack/SGE/package/gridengine/). Are there any known issues or things to avoid when using an SGE execd that's more recent than the SGE qmaster? Thanks, Mark ___ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
[SGE-discuss] case-insensitive user names?
We're using SoGE 8.1.6 in an environment where users may login to the cluster from a Linux workstation (typically using a lower-case login name) or a Windows desktop, where their login name (as supplied by the enterprise Active Directory) is usually mixed-case. On the cluster, we've created two passwd entries per-user with an identical UID, so there's no distinction in file ownership or any permissions or access rights at the Linux shell level. Most users don't notice (or care) about the case that's shown when they login. However, SoGE seems to use the login name literally, not the UID. This causes two problems: job management User "smithj" cannot manage (qdel, qalter) jobs that they submitted as "SmithJ" scheduler weighting Using fair-share scheduling, John Smith will get a disproportinate share of resources if he submits jobs as both "smithj" and "SmithJ" vs. Jane Doe who only submits jobs from her Linux machine as "doej". Is there a way to configure SoGE to treat login IDs with a case-insensitive match, or to use UIDs? We use a JSV pretty extensively, but I didn't see a way to alter login names via a JSV -- any suggestions? Thanks, Mark ___ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
[SGE-discuss] managing scratch space as a consumable/limited resource
We're running S[o]GE 8.1.6 and I'm looking for suggestions re. managing per-node local scratch space as a consumable resource. Currently that temporary space is in a local disk on each node, with directories named: /scratch/${USER} but this would be changed to something like: /scratch/${USER}/${JOBID}${JOB_ARRAY_INDEX} My aim is to have SGE manage scratch space as a resource similar to an h_vmem resource request: 1. ensure the node has enough scratch space before running the job df -h /scratch must be greater than $USER_SCRATCH_REQUEST 2. internally decrement the 'available' scratch space according to the requested amount, even if bytes aren't written to disk yet 3. if the job exceeds the requested scratch space kill the job 4. clean the per-job scratch space when the job is finished rm -rf /scratch/${USER}/${JOBID} I understand that #1 will require a custom load sensor (df -sh /scratch). Feature #2 will require defining scratch space as a consumable complex, with the amount of scratch space defined per-node -- that's not hard. I'm a bit concerned about the overhead of SGE running du -sh /scratch/${USER}/${JOBID}${JOB_ARRAY_INDEX} for each job (up to 40) on a node, crawling deep directory trees on a single hard drive every $schedule_interval. I believe that the presence of the complex will automatically cause SGE to kill the job (#3) if the per-user limit is exceeded, but I'm not sure how the load sensor will communicate per-job scratch directory space consumption, rather than space used (or available) on the entire scratch disk. I'd appreciate suggestions on how to ensure that the sge_execd cleans up the per-job scratch directory at the conclusion of a job. It would be great if there was a flag to supress this behavior, in case scratch files needed to be examined for debugging. Thanks, Mark ___ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
[SGE-discuss] feature request: exempt sshd from SIGXCPU and SIGUSR1
Scenario: Interactive use of our cluster relies on qlogin. To limit long idle login sessions and runaway processes, resource thresholds for interactive jobs are set for s_rt, s_vmem and s_cpu to large values (8hrs, 10GB, 15min), with the corresponding hard limits being set even higher. The system-wide bash_profile traps SIGXCPU and SIGUSR1 and sends the user a warning that they are approaching a limit. Problem: SIGXCPU is sent to the sshd process initiated by qlogin. This is not trapped, causing the login session to close without warning. Requested enhancement: Exempt the sshd initiated by qlogin from being sent any of the "soft" resource quota signals. -- Mark Bergman voice: 215-746-4061 mark.berg...@uphs.upenn.edu fax: 215-614-0266 https://www.cbica.upenn.edu/ IT Technical Director, Center for Biomedical Image Computing and Analytics Department of Radiology University of Pennsylvania ___ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss