TL;DR:  our build system is ancient, EOLed and about to get hit hard w/a
secops hammer.  we need to literally reinstall the entire cluster from
scratch and get things working.

here are the high level bullet points about what's coming up in the next
month:

** all amp-jenkins-worker-* nodes are running centos 6, and the remainder
ubuntu 16.  these will be upgraded to ubuntu 20.

i will be doing this in stages so as to minimize downtime.

** ALL BUILDS NEED TO BE PORTED TO UBUNTU 20.  i can ensure that the
environments on the nodes are identical, but i have yet been able to
successfully build any SBT jobs on any version of ubuntu, and the MVN
builds won't run on ubuntu 18 (tho they work fine on 16).  i also have had
difficulty getting the PRB job to successfully finish on ubuntu.

for this, i will definitely need help from the dev community to get things
working...  and the speed at which things are fixed will be inversely
proportional to how much help i get.  :)

** amplab jenkins primary node will need two major upgrades:  OS from
centos 6 to ubuntu 20, and jenkins from 1.6 to 2.X LTS...

i'm most concerned about this, as it is literally the exact same jenkins
installtion that patrick wendell set up over 10 years ago.  there are many
publish secrets that are entered in to the jenkins config and i'd really
hope that we don't lose them.

my plan here is to upgrade the current jenkins, and fix any things that
break.  then we'll rsync jenkins' homedir to the new primary node and hope
that works.  :)

** user audits

UC berkeley's new security standards require quarterly audits of
non-affiliated accounts...  this won't impact only but a few people on this
list, but i'll need to work w/campus and our department on solutions for
this other than local accounts on the servers.

a LOT is going to happen, and i'm meeting w/my team today and will come up
w/a basic plan.  we will definitely experience downtime during this, but i
cannot guess as to what that will look like.

this might also be a good time to talk about the future of the build
system, auditing our builds (do we need SBT?), or even finally getting
around to dockerizing everything  so i don't need such a fragile and
non-atomic set of worker nodes specifically for spark.

thoughts?  comments?

shane

ps -- this is one of the reasons why i haven't been around much lately...
it's been really tough keeping things up to date while trying to remotely
train up one of my sysadmins to take over some of my build system duties.
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Reply via email to