Folks,

I am trying to figure out how to advise users on starting worker daemons in 
their allocations using srun. That is, I want to be able to run “srun foo”, 
where foo starts some child process and then exits, and the child process(es) 
persist and wait for work.

Use cases for this include Apache Spark and FUSE mounts. In general, it seems 
that there are a number of newer computing frameworks that have this model, in 
particular for the data science space.

We are on Slurm 17.02.10 with the proctrack/cgroup plugin.

I’m using a Python script foo.py to test this (included at end of e-mail). 
After forking, the parent exits immediately, and the child writes the numbers 1 
to 10 at one-second intervals to /tmp/foo, then the word “done”, and then exits.

Desired behavior in a one-node allocation:

$ srun ./foo.py && sleep 12 && cat /tmp/foo
starting cn001.localdomain 79615
0
1
2
3
4
5
6
7
8
9
10
done

Actual behavior:

$ srun ./foo.py && sleep 12 && cat /tmp/foo
starting cn001.localdomain 79615
0

As far as I can tell, what is going on is that when foo.py exits, Slurm 
concludes that the job step is over and kills the child; see debug log at end 
of e-mail.

I have considered the following:

(1) Various command line options, all of which have no effect on this: 
--kill-on-bad-exit=0, --no-kill, --mpi-none, --overcommit, --oversubscribe, 
--wait=0.

(2) srun --task-prolog=./foo.py true

Instead of killing foo.py’s child, this invocation waits for it to exit. Also, 
this seems to require a single executable rather than a command line.

One can work around the waiting to exit by putting the entire command in the 
background, but then subsequent sruns wait until the child completes anyway 
(with the warning “Job step creation temporarily disabled, retrying”). 
--overcommit on the 1st, 2nd, or both sruns has no effect.

Recall that for real-world tasks, the child will run indefinitely waiting for 
work, so we can’t wait for it to finish.

(3) srun sh -c './foo.py && sleep 15' : same behavior as item 2.

(4) Teach Slurm how to deal with the worker daemons somehow.

This doesn’t generalize. We want users to be able to bring whatever compute 
framework they want, without waiting for Slurm support, so they can innovate 
faster.

(5) Put the worker daemons in their own job. For example, one could start the 
Spark worker daemons in one job, with the Spark coordinator daemon and user 
work submission in a second one-node job.

This doesn’t solve the general use case. For example, in the case of Spark, 
I’ve a large test suite where starting and stopping a Spark cluster is only one 
of many tests. For FUSE, which depends on a worker daemon to implement 
filesystem operations, the mount is there to serve the needs of the rest of the 
job script.

(6) Change the software to not daemonize. For example, one can start Spark by 
invoking the .jar files directly, bypassing the daemonizing start script, or in 
newer versions by setting SPARK_NO_DAEMONIZE=1.

This again doesn’t generalize. I need to be able to support imperfect 
scientific software as it arrives, without hacking or framework-specific 
workarounds.

(7) Don’t launch with srun. For example, pdsh can interpret Slurm environment 
variables and uses SSH to launch tasks on my allocated nodes.

This works, and is what I’m doing currently, but it doesn’t scale. One or two 
dozen SSH processes on the first node of my allocation are fine, but 1000 or 
10,000 aren’t. Also, it’s a kludge since srun is specifically provided and 
optimized to launch jobs in a Slurm cluster.

My question: Is there any way I can convince Slurm to let a job step’s children 
keep running beyond the end of the step, and kill them at the end of the job if 
needed. Or, less preferably, overlap job steps?

Much appreciated,
Reid


Appendix 1: foo.py

#!/usr/bin/env python3

# Try to find a way to run daemons under srun.

import os
import socket
import sys
import time

print("starting %s %d" % (socket.gethostname(), os.getpid()))

# one fork is enough to get killed by Slurm
if (os.fork() > 0): sys.exit(0)

fp = open("/tmp/foo", "w")

fp.truncate()
for i in range(10):
   fp.write("%d\n" % i)
   fp.flush()
   time.sleep(1)

fp.write("done\n")

Appendix 2: error log showing job step cleanup removes the worker daemon

slurmstepd: debug level = 6
slurmstepd: debug:  IO handler started pid=62147
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: Using gid list sent by slurmd
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: starting 1 tasks
slurmstepd: task 0 (62153) started 2018-08-27T11:03:33
slurmstepd: debug2: Using gid list sent by slurmd
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: adding task 0 pid 62153 on node 0 to jobacct
slurmstepd: debug:  jobacct_gather_cgroup_cpuacct_attach_task: jobid 206670 
stepid 62 taskid 0 max_task_id 0
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm' 
already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup 
'/sys/fs/cgroup/cpuacct/slurm/uid_1001' already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup 
'/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670' already exists
slurmstepd: debug:  jobacct_gather_cgroup_memory_attach_task: jobid 206670 
stepid 62 taskid 0 max_task_id 0
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm' 
already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup 
'/sys/fs/cgroup/memory/slurm/uid_1001' already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup 
'/sys/fs/cgroup/memory/slurm/uid_1001/job_206670' already exists
slurmstepd: debug2: jag_common_poll_data: 62153 mem size 0 290852 time 
0.000000(0+0)
slurmstepd: debug2: _get_sys_interface_freq_line: filename = 
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq
slurmstepd: debug2:  cpu 1 freq= 2101000
slurmstepd: debug:  jag_common_poll_data: Task average frequency = 2101000 pid 
62153 mem size 0 290852 time 0.000000(0+0)
slurmstepd: debug2: energycounted = 0
slurmstepd: debug2: getjoules_task energy = 0
slurmstepd: debug:  Step 206670.62 memory used:0 limit:251658240 KB
slurmstepd: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
slurmstepd: debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/cpuset' 
entry '/sys/fs/cgroup/cpuset/slurm/system' properties: No such file or directory
slurmstepd: debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/memory' 
entry '/sys/fs/cgroup/memory/slurm/system' properties: No such file or directory
slurmstepd: debug:  Sending launch resp rc=0
slurmstepd: debug:  mpi type = (null)
slurmstepd: debug:  [job 206670] attempting to run slurm task_prolog 
[/opt/slurm/task_prolog]
slurmstepd: debug:  Handling REQUEST_STEP_UID
slurmstepd: debug:  Handling REQUEST_SIGNAL_CONTAINER
slurmstepd: debug:  _handle_signal_container for step=206670.62 uid=0 signal=995
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 
18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 
18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 
18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 
18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 
16384
slurmstepd: debug2: _set_limit: RLIMIT_RSS    : max:inf cur:inf req:257698037760
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_RSS succeeded
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NPROC no change in value: 
8192
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NOFILE no change in 
value: 65536
slurmstepd: debug:  Couldn't find SLURM_RLIMIT_MEMLOCK in environment
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 
18446744073709551615
slurmstepd: debug2: Set task rss(245760 MB)
starting fg001.localdomain 62153
slurmstepd: debug:  Step 206670.62 memory used:0 limit:251658240 KB
slurmstepd: debug2: removing task 0 pid 62153 from jobacct
slurmstepd: task 0 (62153) exited with exit code 0.
slurmstepd: debug:  [job 206670] attempting to run slurm task_epilog 
[/opt/slurm/task_epilog]
slurmstepd: debug2: Using gid list sent by slurmd
slurmstepd: debug2: xcgroup_delete: 
rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62/task_0): Device 
or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete 
/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62/task_0 Device or 
resource busy
slurmstepd: debug2: xcgroup_delete: 
rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62): Device or 
resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete 
/sys/fs/cgroup/cpuacct Device or resource busy
slurmstepd: debug2: xcgroup_delete: 
rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete 
/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670 Device or resource busy
slurmstepd: debug2: xcgroup_delete: 
rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete 
/sys/fs/cgroup/cpuacct/slurm/uid_1001 Device or resource busy
slurmstepd: debug2: xcgroup_delete: 
rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62/task_0): Device 
or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete 
/sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62/task_0 Device or 
resource busy
slurmstepd: debug2: xcgroup_delete: 
rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62): Device or 
resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete 
/sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62 Device or resource busy
slurmstepd: debug2: xcgroup_delete: 
rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete 
/sys/fs/cgroup/memory/slurm/uid_1001/job_206670 Device or resource busy
slurmstepd: debug2: xcgroup_delete: 
rmdir(/sys/fs/cgroup/memory/slurm/uid_1001): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete 
/sys/fs/cgroup/memory/slurm/uid_1001 Device or resource busy
slurmstepd: debug2: step_terminate_monitor will run for 60 secs
slurmstepd: debug2: killing process 62158 (inherited_task) with signal 9
slurmstepd: debug2: xcgroup_delete: 
rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001/job_206670/step_62): Device or 
resource busy
slurmstepd: debug:  _slurm_cgroup_destroy: problem deleting step cgroup path 
/sys/fs/cgroup/freezer/slurm/uid_1001/job_206670/step_62: Device or resource 
busy
slurmstepd: debug2: killing process 62158 (inherited_task) with signal 9
slurmstepd: debug2: xcgroup_delete: 
rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001/job_206670): Device or resource busy
slurmstepd: debug2: xcgroup_delete: 
rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001): Device or resource busy
slurmstepd: debug:  step_terminate_monitor_stop signalling condition
slurmstepd: debug2: step_terminate_monitor is stopping
slurmstepd: debug2: Sending SIGKILL to pgid 62147
slurmstepd: debug:  Waiting for IO
slurmstepd: debug:  Closing debug channel

Reply via email to