Hi, I am wondering about the exact execution order of prolog scripts and plugins in Slurm with the goal to be able to access the freshly created cgroups (by the task/cgroup plugin) in our prolog/epilog scripts, which run with PrologFlags=Alloc to ensure the tranditional batch system behaviour.
We want some information, namely, the prepared cpuset for the job in the prolog and the statistics/counter differences in the epilog. I am aware of accounting and profiling options using slurm and plugins, but there are reasons I want to handle cgroup information myself; maybe even to experiment with things that might go into a slurm plugin at some point. The job cgroups are created after the prolog scripts ran and destroyed before the epilog scripts run (correct? — looks like that). The design seems to focus on individual job steps, having things run closely coupled to the possibly multiple componentes (steps, tasks) for batch jobs that I only have a hazy concept of. Is there a standard way to get the cgroup hierarchy for the job created early, before the per-node prolog script that runs as root (slurmd user), and final cleanup happening later, after the epilog ran? If config doesn't do it, I thought about modifying task/cgroup, but I suspect that the whole scope of the plugin is between the epilogs. Can someone confirm that? I welcome pointers to documentation that explains in detail when which parts of a plugin is run in relation to the slot the prolog scripts get. With https://bugs.schedmd.com/show_bug.cgi?id=9429, there seems to be a way to keep the cgroup around longer, just sabotage the cleanup phase and do it later in the epilog (as I do now on a Ubuntu 20.04 cluster with the distro-provided slurmd that suffers from this bug). But will e.g. moving the code from task_p_pre_setuid() to task_p_slurmd_reserve_resources() give me early access in the prolog? I might just try and break something, but I didn't find yet documentation on these details of the plugin API and for once thought asking around first might be also good. I want cpuset information and at least things like per-node memory high-water marks. The desired granularity is at the job level and it would be nice to get rid of inefficient timeseries to approximate that. The cpuset is needed in advance to user programs starting as I hook a listener to the taskstats interface to cheaply and accurately account for user processes (kernel tasks) with command names. My profiling is somewhere between the hdf5 timeseries and the rought values you get out of sacct, with an orthogonal bit about kernel tasks (to tell the user how many python processes wasted how much memory each). Alrighty then, Thomas PS: I guess lots is possible by writing a custom plugin that ties in with what my prolog/epilog scripts do, but I'd prefer a light touch first. Hacking the scripts during development is far more convenient. -- Dr. Thomas Orgis HPC @ Universität Hamburg