Folks, I am trying to figure out how to advise users on starting worker daemons in their allocations using srun. That is, I want to be able to run “srun foo”, where foo starts some child process and then exits, and the child process(es) persist and wait for work.
Use cases for this include Apache Spark and FUSE mounts. In general, it seems that there are a number of newer computing frameworks that have this model, in particular for the data science space. We are on Slurm 17.02.10 with the proctrack/cgroup plugin. I’m using a Python script foo.py to test this (included at end of e-mail). After forking, the parent exits immediately, and the child writes the numbers 1 to 10 at one-second intervals to /tmp/foo, then the word “done”, and then exits. Desired behavior in a one-node allocation: $ srun ./foo.py && sleep 12 && cat /tmp/foo starting cn001.localdomain 79615 0 1 2 3 4 5 6 7 8 9 10 done Actual behavior: $ srun ./foo.py && sleep 12 && cat /tmp/foo starting cn001.localdomain 79615 0 As far as I can tell, what is going on is that when foo.py exits, Slurm concludes that the job step is over and kills the child; see debug log at end of e-mail. I have considered the following: (1) Various command line options, all of which have no effect on this: --kill-on-bad-exit=0, --no-kill, --mpi-none, --overcommit, --oversubscribe, --wait=0. (2) srun --task-prolog=./foo.py true Instead of killing foo.py’s child, this invocation waits for it to exit. Also, this seems to require a single executable rather than a command line. One can work around the waiting to exit by putting the entire command in the background, but then subsequent sruns wait until the child completes anyway (with the warning “Job step creation temporarily disabled, retrying”). --overcommit on the 1st, 2nd, or both sruns has no effect. Recall that for real-world tasks, the child will run indefinitely waiting for work, so we can’t wait for it to finish. (3) srun sh -c './foo.py && sleep 15' : same behavior as item 2. (4) Teach Slurm how to deal with the worker daemons somehow. This doesn’t generalize. We want users to be able to bring whatever compute framework they want, without waiting for Slurm support, so they can innovate faster. (5) Put the worker daemons in their own job. For example, one could start the Spark worker daemons in one job, with the Spark coordinator daemon and user work submission in a second one-node job. This doesn’t solve the general use case. For example, in the case of Spark, I’ve a large test suite where starting and stopping a Spark cluster is only one of many tests. For FUSE, which depends on a worker daemon to implement filesystem operations, the mount is there to serve the needs of the rest of the job script. (6) Change the software to not daemonize. For example, one can start Spark by invoking the .jar files directly, bypassing the daemonizing start script, or in newer versions by setting SPARK_NO_DAEMONIZE=1. This again doesn’t generalize. I need to be able to support imperfect scientific software as it arrives, without hacking or framework-specific workarounds. (7) Don’t launch with srun. For example, pdsh can interpret Slurm environment variables and uses SSH to launch tasks on my allocated nodes. This works, and is what I’m doing currently, but it doesn’t scale. One or two dozen SSH processes on the first node of my allocation are fine, but 1000 or 10,000 aren’t. Also, it’s a kludge since srun is specifically provided and optimized to launch jobs in a Slurm cluster. My question: Is there any way I can convince Slurm to let a job step’s children keep running beyond the end of the step, and kill them at the end of the job if needed. Or, less preferably, overlap job steps? Much appreciated, Reid Appendix 1: foo.py #!/usr/bin/env python3 # Try to find a way to run daemons under srun. import os import socket import sys import time print("starting %s %d" % (socket.gethostname(), os.getpid())) # one fork is enough to get killed by Slurm if (os.fork() > 0): sys.exit(0) fp = open("/tmp/foo", "w") fp.truncate() for i in range(10): fp.write("%d\n" % i) fp.flush() time.sleep(1) fp.write("done\n") Appendix 2: error log showing job step cleanup removes the worker daemon slurmstepd: debug level = 6 slurmstepd: debug: IO handler started pid=62147 slurmstepd: debug2: mpi/pmi2: _tree_listen_readable slurmstepd: debug2: mpi/pmi2: _task_readable slurmstepd: debug2: Using gid list sent by slurmd slurmstepd: debug2: mpi/pmi2: _tree_listen_readable slurmstepd: debug2: mpi/pmi2: _task_readable slurmstepd: debug2: mpi/pmi2: _tree_listen_readable slurmstepd: debug2: mpi/pmi2: _task_readable slurmstepd: starting 1 tasks slurmstepd: task 0 (62153) started 2018-08-27T11:03:33 slurmstepd: debug2: Using gid list sent by slurmd slurmstepd: debug2: mpi/pmi2: _tree_listen_readable slurmstepd: debug2: mpi/pmi2: _task_readable slurmstepd: debug2: mpi/pmi2: _tree_listen_readable slurmstepd: debug2: mpi/pmi2: _task_readable slurmstepd: debug2: adding task 0 pid 62153 on node 0 to jobacct slurmstepd: debug: jobacct_gather_cgroup_cpuacct_attach_task: jobid 206670 stepid 62 taskid 0 max_task_id 0 slurmstepd: debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm' already exists slurmstepd: debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_1001' already exists slurmstepd: debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670' already exists slurmstepd: debug: jobacct_gather_cgroup_memory_attach_task: jobid 206670 stepid 62 taskid 0 max_task_id 0 slurmstepd: debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm' already exists slurmstepd: debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_1001' already exists slurmstepd: debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_1001/job_206670' already exists slurmstepd: debug2: jag_common_poll_data: 62153 mem size 0 290852 time 0.000000(0+0) slurmstepd: debug2: _get_sys_interface_freq_line: filename = /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq slurmstepd: debug2: cpu 1 freq= 2101000 slurmstepd: debug: jag_common_poll_data: Task average frequency = 2101000 pid 62153 mem size 0 290852 time 0.000000(0+0) slurmstepd: debug2: energycounted = 0 slurmstepd: debug2: getjoules_task energy = 0 slurmstepd: debug: Step 206670.62 memory used:0 limit:251658240 KB slurmstepd: debug: Reading cgroup.conf file /etc/slurm/cgroup.conf slurmstepd: debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/cpuset' entry '/sys/fs/cgroup/cpuset/slurm/system' properties: No such file or directory slurmstepd: debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/memory' entry '/sys/fs/cgroup/memory/slurm/system' properties: No such file or directory slurmstepd: debug: Sending launch resp rc=0 slurmstepd: debug: mpi type = (null) slurmstepd: debug: [job 206670] attempting to run slurm task_prolog [/opt/slurm/task_prolog] slurmstepd: debug: Handling REQUEST_STEP_UID slurmstepd: debug: Handling REQUEST_SIGNAL_CONTAINER slurmstepd: debug: _handle_signal_container for step=206670.62 uid=0 signal=995 slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615 slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615 slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615 slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 18446744073709551615 slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 16384 slurmstepd: debug2: _set_limit: RLIMIT_RSS : max:inf cur:inf req:257698037760 slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_RSS succeeded slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NPROC no change in value: 8192 slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NOFILE no change in value: 65536 slurmstepd: debug: Couldn't find SLURM_RLIMIT_MEMLOCK in environment slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615 slurmstepd: debug2: Set task rss(245760 MB) starting fg001.localdomain 62153 slurmstepd: debug: Step 206670.62 memory used:0 limit:251658240 KB slurmstepd: debug2: removing task 0 pid 62153 from jobacct slurmstepd: task 0 (62153) exited with exit code 0. slurmstepd: debug: [job 206670] attempting to run slurm task_epilog [/opt/slurm/task_epilog] slurmstepd: debug2: Using gid list sent by slurmd slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62/task_0): Device or resource busy slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62/task_0 Device or resource busy slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62): Device or resource busy slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct Device or resource busy slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670): Device or resource busy slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670 Device or resource busy slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001): Device or resource busy slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001 Device or resource busy slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62/task_0): Device or resource busy slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62/task_0 Device or resource busy slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62): Device or resource busy slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62 Device or resource busy slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670): Device or resource busy slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001/job_206670 Device or resource busy slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001): Device or resource busy slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001 Device or resource busy slurmstepd: debug2: step_terminate_monitor will run for 60 secs slurmstepd: debug2: killing process 62158 (inherited_task) with signal 9 slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001/job_206670/step_62): Device or resource busy slurmstepd: debug: _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm/uid_1001/job_206670/step_62: Device or resource busy slurmstepd: debug2: killing process 62158 (inherited_task) with signal 9 slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001/job_206670): Device or resource busy slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001): Device or resource busy slurmstepd: debug: step_terminate_monitor_stop signalling condition slurmstepd: debug2: step_terminate_monitor is stopping slurmstepd: debug2: Sending SIGKILL to pgid 62147 slurmstepd: debug: Waiting for IO slurmstepd: debug: Closing debug channel