We usually we set up a reservation for maintenance. This prevents jobs from starting if they are not expected to end before the reservation (maintenance) starts. As Paul indicated, this causes nodes to become idle (and pending job queue to grow) as maintenance time approaches, but avoids requiring users to resubmit partially completed jobs, especially since many of our users do notbioe464-1v2y adequately checkpoint.
Draining all of the nodes has the disadvantage of potentially increasing cluster idle time even more --- if your maximum walltime is 3 days and you start draining at T-3d, if all jobs on the nodes have walltime of at most 1d than cluster is completely idle at T-2d. Which is fine if you can effect the maintenance then and end 2d early, but problematic if you can;t, as no jobs can run those 2 days. With reservation, short jobs continue to run until reservation starts. But draining nodes is useful when yuo can effect the maintenance early if nodes become available, and particularly in cases where only a limited number of nodes are involved. On Thu, Aug 6, 2020 at 1:54 PM Paul Edmon <ped...@cfa.harvard.edu> wrote: > Because we want to maximize usage we actually have opted to just cancel > all running jobs the day of. We send out notification to all the users > that this will happen. We haven't really seen any complaints and we've > been doing this for years. At the start of the outage we set all > partitions to down, then run a cancel over all the running jobs. Pending > jobs are left in place, and users are allowed to submit work during the > outage and when we reopen everything gets going again. > > So there is a third option, though you have to accept that jobs will be > cancelled to pull it off. > > -Paul Edmon- > On 8/6/2020 1:13 PM, Jason Simms wrote: > > Hello all, > > Later this month, I will have to bring down, patch, and reboot all nodes > in our cluster for maintenance. The two options available to set nodes into > a maintenance mode seem to be either: 1) creating a system-wide > reservation, or 2) setting all nodes into a DRAIN state. > > I'm not sure it really matters either way, but is there any preference one > way or the other? Any gotchas I should be aware of? > > Warmest regards, > Jason > > -- > *Jason L. Simms, Ph.D., M.P.H.* > Manager of Research and High-Performance Computing > XSEDE Campus Champion > Lafayette College > Information Technology Services > 710 Sullivan Rd | Easton, PA 18042 > Office: 112 Skillman Library > p: (610) 330-5632 > > -- Tom Payerle DIT-ACIGS/Mid-Atlantic Crossroads paye...@umd.edu 5825 University Research Park (301) 405-6135 University of Maryland College Park, MD 20740-3831