If your workflows are primarily CPU-bound rather than memory-bound, and since you’re the only user, you could ensure all your Slurm scripts ‘nice’ their Python commands, or use the -n flag for slurmd and the PropagatePrioProcess configuration parameter. Both of these are in the thread at https://lists.schedmd.com/pipermail/slurm-users/2018-September/001926.html
-- Mike Renfro / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University > On Sep 22, 2018, at 1:01 AM, A <andrealp...@gmail.com> wrote: > > Hi John! Thanks for the reply, lots to think about. > > In terms of suspending/resuming, my situation might be a bit different than > other people. As I mentioned this is an install on a single node workstation. > This is my daily office machine. I run alot of python processing scripts that > have low CPU need but lots of iterations. I found it easier to manage these > in slurm, opposed to writing mpi/parallel processing routines in python > directly. > > Given this, sometimes I might submit a slurm array with 10K jobs, that might > take a week to run, but I still need to sometimes do work during the day that > requires more CPU power. In those cases I suspend the background array, crank > through whatever I need to do and then resume in the evening when I go home. > Sometimes I can say for jobs to finish, sometimes I have to break in the > middle of running jobs > > On Fri, Sep 21, 2018, 10:07 PM John Hearns <hear...@googlemail.com> wrote: > Ashton, on a compute node with 256Gbytes of RAM I would not > configure any swap at all. None. > I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM - > and no swap. > Also our ICE clusters were diskless - SGI very smartly configured swap > over ISCSI - but we disabled this, the reason being that if one node > in a job starts swapping the likelihood is that all the nodes are > swapping, and things turn to treacle from there. > Also, as another issue, if you have lots of RAM you need to look at > the vm tunings for dirty ratio, background ratio and centisecs. Linux > will aggressively cache data which is written to disk - you can get a > situation where your processes THINK data is written to disk but it is > cached, then what happens of there is a power loss? SO get those > caches flushed often. > https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ > > Oh, and my other tip. In the past vm.min_free_kbytes was ridiculously > small on default Linux systems. I call this the 'wriggle room' when a > system is short on RAM. Think of it like those square sliding letters > puzzles - min_free_kbytes is the empty square which permits the letter > tiles to move. > SO look at your min_free_kbytes and increase it (If I'm not wrong in > RH7 and Centos7 systems it is a reasonable value already) > https://bbs.archlinux.org/viewtopic.php?id=184655 > > Oh, and it is good to keep a terminal open with 'watch cat > /proc/meminfo' I have spent many a happy hour staring at that when > looking at NFS performance etc. etc. > > Back to your specific case. My point is that for HPC work you should > never go into swap (with a normally running process, ie no job > pre-emption). I find that 20 percent rule is out of date. Yes, > probably you should have some swap on a workstation. And yes disk > space is cheap these days. > > > However, you do talk about job pre-emption and suspending/resuming > jobs. I have never actually seen that being used in production. > At this point I would be grateful for some education from the choir - > is this commonly used and am I just hopelessly out of date? > Honestly, anywhere I have managed systems, lower priority jobs are > either allowed to finish, or in the case of F1 we checkpointed and > killed low priority jobs manually if there was a super high priority > job to run. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 21 Sep 2018 at 22:34, A <andrealp...@gmail.com> wrote: > > > > I have a single node slurm config on my workstation (18 cores, 256 gb ram, > > 40 Tb disk space). I recently just extended the array size to its current > > config and am reconfiguring my LVM logical volumes. > > > > I'm curious on people's thoughts on swap sizes for a node. Redhat these > > days recommends up to 20% of ram size for swap size, but no less than 4 gb. > > > > But......according to slurm faq; > > "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals > > respectively, so swap and disk space should be sufficient to accommodate > > all jobs allocated to a node, either running or suspended." > > > > So I'm wondering if 20% is enough, or whether it should scale by the number > > of single jobs I might be running at any one time. E.g. if I'm running 10 > > jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap? > > > > any thoughts? > > > > -ashton >