please read: current state and the future of the apache spark build system

shane knapp ☠ Wed, 07 Apr 2021 11:39:08 -0700

this will be a relatively big update, as there are many many moving pieces
with short, medium and long term goals.


TLDR1:   we're shutting jenkins down at the end of 2021.

TLDR2:  i know we're way behind on pretty much everything.  most of the
hardware is at or beyond EOL, and random systemic build failures (like
k8s/minikube) are randomly popping up.  i've had to restrict access due to
new campus policies, and i will be dealing with that shortly and only for a
few contributors.

long term (until EOY):
* decide what the future of spark builds and releases will look like
  - do we need jenkins?
  - if we do, who's responsible for hosting + ops?
* we will permanently shut down amplab jenkins by the end of 2021
  - uc berkeley has funded this for over 10 years, and both the funds and
staff (only me, for 7 years) are going away.  i'm staying at cal, but have
a much different job now.  :)

medium term (in 6 months):
* prepare jenkins worker ansible configs and stick in the spark repo
  - nothing fancy, but enough to config ubuntu workers
  - could be used to create docker containers for testing in
<wavey-hands>THE CLOUD</wavey-hands>
* train up brian shiratsuki (cced) to help w/ops tasks and upgrades over
the next ~6m
* get to all of the python version, library installation, etc etc jira
requests

short term(weeks):
* debug and figure out why minikube/k8s broke
  - https://issues.apache.org/jira/browse/SPARK-34738
  - i really could use some help here...
* bring up additional workers
  - finish hardware/system level repairs on the bare metal
  - see above, re k8s jira
* stabilize cluster
  - recent jenkins LTS upgrade broke the web GUI
  - finish deploying monitoring/alerting
  - this hardware is OLD and literally falling over, so we have lots of
random disk and ram failures.  it's literally whack-a-mole and each trip to
the colo to repair literally takes a full day

i'm only able to spend a few hours a week on the build system, so expect
random downtime, reboots, restarts, and testing.  we're testing new nodes
as we deploy, and hoping to fix anything before releasing them into the
wild, but some things might be flaky.

but the biggest question is what you all need w/regards to build
infrastructure...  and who's going to be responsible for it.

thanks for reading!  :)

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

please read: current state and the future of the apache spark build system

Reply via email to