Sun Grid Engine / NFS and Python shell execution question

J.B. Brown Thu, 22 Jul 2010 07:54:56 -0700

Hello everyone, and thanks for your time to read this.

For quite some time, I have had a problem using Python's shell
execution facilities in combination with a cluster computer
environment (such as Sun Grid Engine (SGE)).
In particular, I wish to repeatedly execute a number of commands in
sub-shells or pipes within a single function, and the repeated
execution is depending on the previous execution, so just writing a
brute force script file and executing  commands is not an option for
me.


To isolate and exemplify my problem, I have created three files:
(1) one which exemplifies the spirit of the code I wish to execute in Python
(2) one which serves as the SGE execution script file, and actually
calls python to execute the code in (1)
(3) a simple shell script which executes (2) a sufficient number of
times that it fills all processors on my computing cluster and leaves
an additional number of jobs in the queue.

Here is the spirit of the experiment/problem:
generateTest.py:
----------------------------------------------
# Constants
numParallelJobs = 100
testCommand = "continue"   #"os.popen( \"clear\" )"
loopSize = "1000"

# First, write file with test script.
pythonScript = file( "testScript.py", "w" )
pythonScript.write(
"""
import os
for i in range( 0, """ + loopSize + """ ):
 for j in range( 0, """ + loopSize + """ ):
  for k in range( 0, """ + loopSize + """ ):
   for l in range( 0, """ + loopSize + """ ):
    """ + testCommand + """
""" )
pythonScript.close()

# Second, write SGE script file to execute the Python script.
sgeScript = file( "testScript.sge", "w" )
sgeScript.write (
"""
#$ -cwd
#$ -N pythonTest
#$ -e /export/home/jbbrown/errorLog
#$ -o /export/home/jbbrown/outputLog
python testScript.py
""" )
sgeScript.close()

# Finally, write script to run SGE script a specified number of times.
import os
launchScript = file( "testScript.sh", "w" )
for i in range( 0, numParallelJobs ):
 launchScript.write( "qsub testScript.sge" + os.linesep )
launchScript.close()

----------------------------------------------

Now, let's assume that I have about 50 processors available across 8
compute nodes, with one NFS-mounted disk.
If I run the code as above, simply executing Python "continue"
statements and do nothing, the cluster head node reports no serious
NFS daemon load.

However - if I change the code to use the os.popen() call shown as a
comment above, or use os.system(),
the NFS daemon load on my system skyrockets within seconds of
distributing the jobs to the compute nodes -- even though I'm doing
nothing but executing the clear screen command, which technically
doesn't pipe any output to the location for logging stdout.
Even if I change the SGE script file to redirect standard output and
error to explicitly go to /dev/null, I still have the same problem.

I believe the source of this problem is that os.popen() or os.system()
calls spawn subshells which then reference my shell resource files
(.zshrc, .cshrc, .bashrc, etc.).
But I don't see an alternative to os.popen{234} or os.system().
os.exec*() cannot solve my problem, because it transfers execution to
that program and stops executing the script which called os.exec*().

Without having to rewrite a considerable amount of code (which
performs cross validation by repeatedly executing in a subshell) in
terms of a shell script language filled with a large number of
conditional statements, does anyone know of a way to execute external
programs in the middle of a script without referencing the shell
resource file located on an NFS mounted directory?
I have read through the >help(os) documentation repeatedly, but just
can't find a solution.

Even a small lead or thought would be greatly appreciated.

With thanks from humid Kyoto,
J.B. Brown
-- 
http://mail.python.org/mailman/listinfo/python-list

Sun Grid Engine / NFS and Python shell execution question

Reply via email to