Hi all:

Running Slurm 20.11.8.  I missed a chance at a recent outage to change our 
JobAcctGatherType from 'linux' to 'cgroup'.  Our ProctrackType has been 
'cgroup' for a long time.  In short, I'm thinking it would harmless for me to 
do this now, with running jobs, and below I discuss the caveats I know of.  
Have any of you made this change with jobs running, or see why in my case I 
should not?

More info:

I see the warnings in the doc about not changing JobAcctGatherType while jobs 
are running.  Some of you have asked SchedMd about this before:

- In 
slurm-dev.schedmd.narkive.com/EbK7qgSg/adding-jobacctgather-plugin-causing-rpc-errors#post1
 from 2013, Moe says "don't change this while jobs are running; I'll doc that." 
 (Hence it being doc'd now.)

- https://bugs.schedmd.com/show_bug.cgi?id=861 in 2014 mentioned that doing so 
would break 'sstat' for the already-running jobs.

- in https://bugs.schedmd.com/show_bug.cgi?id=2781 in 2016 SchedMD repeated the 
doc'd warning.  In that case, the user reported job tasks completing while 
Slurm considered the jobs still running.

On a dev cluster, I started a job, then changed JobAcctGatherType from 'linux' 
to 'cgroup', then restarted slurmctld, then the slurmds.  That job continued to 
run and was terminated by its timelimit.  This was replicable.

I submitted a job with a known RAM-vs-time profile to several otherwise idle 
nodes.  One node I left alone.  The other four I switched from 'linux' to 
'cgroup' at varied times during the jobs' lives.  We have a Prometheus exporter 
which feeds a Grafana instance to graph the cgroup data.  I looked at the 
'memory' data across the nodes.  One of them reported falsely high memory for 
the test job.  Running the same job again without touching slurmd mid-job 
yielded identically correct graphs across the nodes.

Suppose I switch my cluster (slurmctld, all slurmds) at time T0.  In principle 
a user might want to size her jobs and happen to look at the affected one of 
the memory-related metrics for a job which was running at T0.and get inaccurate 
info.  Modulo that, we can afford to declare the memory-usage historical info 
re: the jobs running at T0 (we could tolerate any seeming inacurracies in 
fairshare arising from that info being inaccurate, and don't yet have e.g. a 
MaxTresPerX with some RAM value).  With our 'cgroup' ProcTrackType, and 
requiring a mem spec on all jobs, I think we don't need worry if a given slurmd 
is sending slurmctld wrong or incomprehensible information about a given job's 
resource usage.

Does anyone know of reason to think otherwise?  Thanks for reading this far :)

--
Grinning like an idiot,
Paul Brunk, system administrator
Georgia Advanced Computing Resource Center (GACRC)
Enterprise IT Svcs, the University of Georgia


Reply via email to