On 04/06/2017 11:02 AM, Orion Poplawski wrote: > I've been starting to poke some more into the sge code and have some > questions and observations about the use of sge_peopen() as running external > processes, especially when configured to run with an admin user. > > First, there is: > > /* > * TODO: CLEANUP > * > * This function is DEPRECATED and should be removed with the next > * major release. > * > * This function can't be used in multi threaded environments because it > * might cause a deadlock in the executing qmaster thread. > * Use sge_peopen_r() instead. > */ > pid_t sge_peopen(const char *shell, int login_shell, const char *command, > > Does this seem reasonable to do? I've got a patch to do that if so. > > Second, this function is used by: > > * sge_execd to start the load sensor > * various functions in sge/source/libs/gdi/sge_security.c to run security > helper scripts: sge_set_cred()->get_token_cmd,get_cred, > cache_sec_cred()->get_cred, delete_credentials()->delete_cred, > store_sec_cred/2()->put_cred > * jsv_start() -> JSV_command > * sge_afs_extend_token(command) -> command > * sge_get_pids(pscommand) -> pscommand > * sge_checkprog(pscommon) -> pscommand > > Notably it is not used to launch jobs. > > The current behavior of sge_peopen_r() is to switch back to the root (or > user that started the sge_execd/qmaster command) before spawning the command. > Notably this results in load sensors being run as root, which strikes me as a > very bad idea. > I've been working on an additional patch to change peopen's behavior to > only switch root if it was requested to change the user, which currently none > of the callers do. This now has the load sensor running as sgeadmin. > > > I also changed sge_qmaster on my install to startup as the sgeadmin user > by adding: > > User=sgeadmin > > to the sge_qmaster.service unit file. So far I haven't noticed any issues. > > What I'm least sure of, and what I'm just starting to explore in more > detail is the security credential handling code. This is what got me started > in the first place as I want to start using kerberos with our system. I'll > post more on that score in a followup.
While testing out the credential handling by sge_qmaster, I found this: 04/06/2017 11:28:37|worker|vulcan7|E|could not store credentials for job 15 - command "/usr/share/gridengine/utilbin/lx-amd64/put_cred" failed with return code 10 This because sge_qmaster is ignoring SIGCHLD and setting SA_NOCLDWAIT, and thus waitpid() is returning with errno 10 - ECHILD because the child has already exited and we said we didn't care. This appears to date back quite a ways: commit fd6c976608cbde90d95cfb6a04eaee793a60ce68 Author: adoerr <adoerr> Date: Wed Nov 3 10:53:39 2004 +0000 *** empty log message *** diff --git a/Changelog b/Changelog index 482d358..57c8ee0 100644 --- a/Changelog +++ b/Changelog @@ -1,3 +1,9 @@ +AD-2004-11-03-0: Bugfix: '-m a' qsub option did leave a zombie process + Review: EB + Changed: qmaster + Issue: 1277 + Bugtraq: 5104789 + but this completely breaks sge_peopen()/sge_peclose functionality. At for now with my testing I'm going to revert this. Perhaps some mailing code will need to add the necessary waitpid() call. -- Orion Poplawski Technical Manager 720-772-5637 NWRA, Boulder/CoRA Office FAX: 303-415-9702 3380 Mitchell Lane or...@nwra.com Boulder, CO 80301 http://www.nwra.com _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss