On 04/06/2017 11:02 AM, Orion Poplawski wrote:
> I've been starting to poke some more into the sge code and have some
> questions and observations about the use of sge_peopen() as running external
> processes, especially when configured to run with an admin user.
>
> First, there is:
>
> /*
> * TODO: CLEANUP
> *
> * This function is DEPRECATED and should be removed with the next
> * major release.
> *
> * This function can't be used in multi threaded environments because it
> * might cause a deadlock in the executing qmaster thread.
> * Use sge_peopen_r() instead.
> */
> pid_t sge_peopen(const char *shell, int login_shell, const char *command,
>
> Does this seem reasonable to do? I've got a patch to do that if so.
>
> Second, this function is used by:
>
> * sge_execd to start the load sensor
> * various functions in sge/source/libs/gdi/sge_security.c to run security
> helper scripts: sge_set_cred()->get_token_cmd,get_cred,
> cache_sec_cred()->get_cred, delete_credentials()->delete_cred,
> store_sec_cred/2()->put_cred
> * jsv_start() -> JSV_command
> * sge_afs_extend_token(command) -> command
> * sge_get_pids(pscommand) -> pscommand
> * sge_checkprog(pscommon) -> pscommand
>
> Notably it is not used to launch jobs.
>
> The current behavior of sge_peopen_r() is to switch back to the root (or
> user that started the sge_execd/qmaster command) before spawning the command.
> Notably this results in load sensors being run as root, which strikes me as a
> very bad idea.
> I've been working on an additional patch to change peopen's behavior to
> only switch root if it was requested to change the user, which currently none
> of the callers do. This now has the load sensor running as sgeadmin.
>
>
> I also changed sge_qmaster on my install to startup as the sgeadmin user
> by adding:
>
> User=sgeadmin
>
> to the sge_qmaster.service unit file. So far I haven't noticed any issues.
>
> What I'm least sure of, and what I'm just starting to explore in more
> detail is the security credential handling code. This is what got me started
> in the first place as I want to start using kerberos with our system. I'll
> post more on that score in a followup.
While testing out the credential handling by sge_qmaster, I found this:
04/06/2017 11:28:37|worker|vulcan7|E|could not store credentials for job 15 -
command "/usr/share/gridengine/utilbin/lx-amd64/put_cred" failed with return
code 10
This because sge_qmaster is ignoring SIGCHLD and setting SA_NOCLDWAIT, and
thus waitpid() is returning with errno 10 - ECHILD because the child has
already exited and we said we didn't care.
This appears to date back quite a ways:
commit fd6c976608cbde90d95cfb6a04eaee793a60ce68
Author: adoerr <adoerr>
Date: Wed Nov 3 10:53:39 2004 +0000
*** empty log message ***
diff --git a/Changelog b/Changelog
index 482d358..57c8ee0 100644
--- a/Changelog
+++ b/Changelog
@@ -1,3 +1,9 @@
+AD-2004-11-03-0: Bugfix: '-m a' qsub option did leave a zombie process
+ Review: EB
+ Changed: qmaster
+ Issue: 1277
+ Bugtraq: 5104789
+
but this completely breaks sge_peopen()/sge_peclose functionality. At for now
with my testing I'm going to revert this. Perhaps some mailing code will need
to add the necessary waitpid() call.
--
Orion Poplawski
Technical Manager 720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com
_______________________________________________
SGE-discuss mailing list
[email protected]
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss