On 04/06/2017 11:02 AM, Orion Poplawski wrote:
>     I've been starting to poke some more into the sge code and have some
> questions and observations about the use of sge_peopen() as running external
> processes, especially when configured to run with an admin user.
> 
> First, there is:
> 
> /*
>  * TODO: CLEANUP
>  *
>  * This function is DEPRECATED and should be removed with the next
>  * major release.
>  *
>  * This function can't be used in multi threaded environments because it
>  * might cause a deadlock in the executing qmaster thread.
>  * Use sge_peopen_r() instead.
>  */
> pid_t sge_peopen(const char *shell, int login_shell, const char *command,
> 
> Does this seem reasonable to do?  I've got a patch to do that if so.
> 
> Second, this function is used by:
> 
> * sge_execd to start the load sensor
> * various functions in sge/source/libs/gdi/sge_security.c to run security
> helper scripts: sge_set_cred()->get_token_cmd,get_cred,
> cache_sec_cred()->get_cred, delete_credentials()->delete_cred,
> store_sec_cred/2()->put_cred
> * jsv_start() -> JSV_command
> * sge_afs_extend_token(command) -> command
> * sge_get_pids(pscommand) -> pscommand
> * sge_checkprog(pscommon) -> pscommand
> 
> Notably it is not used to launch jobs.
> 
>     The current behavior of sge_peopen_r() is to switch back to the root (or
> user that started the sge_execd/qmaster command) before spawning the command.
> Notably this results in load sensors being run as root, which strikes me as a
> very bad idea.
>     I've been working on an additional patch to change peopen's behavior to
> only switch root if it was requested to change the user, which currently none
> of the callers do.  This now has the load sensor running as sgeadmin.
> 
> 
>     I also changed sge_qmaster on my install to startup as the sgeadmin user
> by adding:
> 
> User=sgeadmin
> 
> to the sge_qmaster.service unit file.  So far I haven't noticed any issues.
> 
>     What I'm least sure of, and what I'm just starting to explore in more
> detail is the security credential handling code.  This is what got me started
> in the first place as I want to start using kerberos with our system.  I'll
> post more on that score in a followup.

While testing out the credential handling by sge_qmaster, I found this:

04/06/2017 11:28:37|worker|vulcan7|E|could not store credentials for job 15 -
command "/usr/share/gridengine/utilbin/lx-amd64/put_cred" failed with return
code 10

This because sge_qmaster is ignoring SIGCHLD and setting SA_NOCLDWAIT, and
thus waitpid() is returning with errno 10 - ECHILD because the child has
already exited and we said we didn't care.

This appears to date back quite a ways:

commit fd6c976608cbde90d95cfb6a04eaee793a60ce68
Author: adoerr <adoerr>
Date:   Wed Nov 3 10:53:39 2004 +0000

    *** empty log message ***

diff --git a/Changelog b/Changelog
index 482d358..57c8ee0 100644
--- a/Changelog
+++ b/Changelog
@@ -1,3 +1,9 @@
+AD-2004-11-03-0: Bugfix:    '-m a' qsub option did leave a zombie process
+                 Review:    EB
+                 Changed:   qmaster
+                 Issue:     1277
+                 Bugtraq:   5104789
+

but this completely breaks sge_peopen()/sge_peclose functionality.  At for now
with my testing I'm going to revert this.  Perhaps some mailing code will need
to add the necessary waitpid() call.


-- 
Orion Poplawski
Technical Manager                          720-772-5637
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       or...@nwra.com
Boulder, CO 80301                   http://www.nwra.com
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to