On 3/21/19 11:49 AM, Ryan Novosielski wrote:
On Mar 21, 2019, at 11:26 AM, Prentice Bisbal <pbis...@pppl.gov> wrote:
On 3/20/19 1:58 PM, Christopher Samuel wrote:
On 3/20/19 4:20 AM, Frava wrote:

Hi Chris, thank you for the reply.
The team that manages that cluster is not very fond of upgrading SLURM, which I 
understand.
As a system admin who manages clusters myself, I don't understand this. Our job 
is to provide and maintain resources for our users. Part of that maintenance is 
to provide updates for security, performance, and functionality (new features) 
reasons. HPC has always been a leading-edge kind if field, so I feel this is 
even more important for HPC admins.

Yes, there can be issues caused by updates, but those can be with proper 
planning: Have a plan to do the actual upgrade, have a plan to test for issues, 
and have a plan to revert to an earlier version if issues are discovered. This 
is work, but it's really not all that much work, and this is exactly the work 
we are being paid to do as cluster admins.

 From my own experience, I find *not* updating in a timely manner is actually 
more problematic and more work than keep on top of updates. For example, where 
I work now, we still haven't upgraded to CentOS 7, and as a result, many basic 
libraries are older than what many of the open-source apps my users need 
require. As a result, I don't just have to install application X, I often have 
to install up-to-date versions of basic libraries like libreadline, libcurses, 
zlib, etc. And then there are the security concerns...

Okay, rant over. I'm sorry. It just bothers me when I hear fellow system admins aren't 
"very fond" of things that I think are a core responsbility of our jobs. I take 
a lot of pride on my job.
All of those things take time, depending on where you work (not necessarily 
speaking about my current employer/employment situation), you may be ordered to 
do something else with that time. If so, all bets are off. Planned updates 
where sufficient testing time is not allotted moves the associated work from 
planned work to unplanned emergency (something broken, etc.), and in some cases 
from business hours to off hours, generate lots of support queries, etc.

I’ve never seen a paycheck signed by “Best Practices”.

Like I said in my original e-mail, my experience has taught me that NOT doing those things actually takes more time and work in the end than doing them. "Best practices: has never signed my paycheck either, but there's a reason why they get the title "best", right? I have certainly received comments on my performance reviews about how reliable my systems have been, and that certainly leads to a bigger paycheck (in theory, at least).

Prentice





Reply via email to