+1 from the Heron team. Is there any chance we could work in/learn about build caching in this process? Full builds for Heron take several hours, it'd be nice to speed them up.
We use Bazel to build, here are some details: https://docs.bazel.build/versions/main/remote-caching.html On Sat, Jan 29, 2022 at 1:27 PM Chris Lambertus <c...@apache.org> wrote: > There is no timeline and certainly no design doc. We have funding, but > little in-house Infra experience with such an endeavor. We are looking for > a community champion with experience in this area to help us design a > solution. > > Our funding is in AWS, so yes, we could provide IAM access to specific > services once we get a general idea of the type of solution we want to > provide. > > Short term initiative: > > - develop a process for deploying 'on demand' build resources within > Jenkins via EC2 > - allow for the use of GPU nodes > - figure out how to track usage and constrain spending within the funding > limit > - figure out how to deal with security push credentials (nexus, nightlies, > dockerhub, etc.) > > Longer-term > > - provide EKS/ECS integration where appropriate > > The simplest case here would be for builds which are already containerized > (e.g. don't require Infra-deployed dependencies), as we could deploy a > "bare metal" AMI. If we needed to add the large number of tools Infra > maintains, creating and updating the AMI would be quite cumbersome. This is > something that will need to be sorted out if we are to roll out > general-purpose build nodes 'on-demand'. > > Here are some points of note from the thread so far: > > - Amazon EC2 Plugin for Jenkins can help > - GPU nodes desired by some projects > - Use of auto-scaling groups rather than containers > > Projects interested in contributing to setup/design: > > - SystemDS > - Airflow > - Heron > > > > > > On Jan 22, 2022, at 4:29 AM, Janardhan Pulivarthi < > janardhan.pulivar...@gmail.com> wrote: > > > > Hi Chris, > > > > At present we would want to use AWS for GPU instances for testing and > > for building docker (gpu) images. > > > > Is there any timeline or design doc. > > > > How does the quota work for projects? > > Would you like to provide iam accounts with specific services in need > > for a project? > > > > Thanks and Regards, > > Janardhan > > > > On Sat, Jan 1, 2022 at 12:19 AM Allen Wittenauer > > <a...@effectivemachines.com.invalid> wrote: > >> > >> > >> > >>> On Dec 30, 2021, at 10:58 AM, Chris Lambertus <c...@apache.org> wrote: > >>> > >>> Hi folks, > >>> > >>> We have some funding to explore providing ephemeral builds via ECS or > EKS in the Amazon ecosystem, but Infra does not have expertise in this > area. We would like to integrate such a service with Jenkins. > >>> > >>> Does anyone have experience with using these services for CI, and > would you be interested in assisting Infra in developing a prototype? > >>> > >>> Additionally, we may be able to provide some build nodes with GPUs. Do > we have projects which could/would make use of GPUs for integration testing? > >> > >> > >> At $DAYJOB, I configured the Amazon EC2 plug-in ( > https://plugins.jenkins.io/ec2 ) to do this type of thing using spot > instances with labels tied to the particular EC2 node type that our jobs > use. I avoided using the EC2 Fleet plug-in ( > https://plugins.jenkins.io/ec2-fleet ) mainly because it always seemed to > keep at least one node running which is not really want you want to get the > most bang for your buck. In other words, startup time is less important to > me than having a node run idle all weekend. > >> > >> Biggest issues we’ve hit with this setup are: > >> > >> a) Depending upon your spot price, you may get outbid and the node gets > killed out from underneath you (rarely happens but it does happen with our > bid) > >> > >> b) You need to know ahead of time what types of nodes you want to > allocate and then set a label to match. For the ASF, that might be tricky > given a lot of people have no idea what the actual requirements for their > jobs are. > >> > >> c) During a Jenkins restart on rare occasions, the plug-in will ‘lose > track’ of allocated nodes. We have limits for how long our allocations will > last based on # of runs and idle time so generally can spot a ‘stuck’ node > after a day or so. > >> > >> I haven’t tried configuring it use EKS because none of our stuff needs > k8s yet. > >