Re: ephemeral builds via AWS ECS and/or EKS? GPU Nodes?

Josh Fischer Sat, 29 Jan 2022 12:59:58 -0800

+1 from the Heron team.  Is there any chance we could work in/learn about
build caching in this process?  Full builds for Heron take several hours,
it'd be nice to speed them up.


We use Bazel to build, here are some details:
https://docs.bazel.build/versions/main/remote-caching.html

On Sat, Jan 29, 2022 at 1:27 PM Chris Lambertus <c...@apache.org> wrote:

> There is no timeline and certainly no design doc. We have funding, but
> little in-house Infra experience with such an endeavor. We are looking for
> a community champion with experience in this area to help us design a
> solution.
>
> Our funding is in AWS, so yes, we could provide IAM access to specific
> services once we get a general idea of the type of solution we want to
> provide.
>
> Short term initiative:
>
> - develop a process for deploying 'on demand' build resources within
> Jenkins via EC2
> - allow for the use of GPU nodes
> - figure out how to track usage and constrain spending within the funding
> limit
> - figure out how to deal with security push credentials (nexus, nightlies,
> dockerhub, etc.)
>
> Longer-term
>
> - provide EKS/ECS integration where appropriate
>
> The simplest case here would be for builds which are already containerized
> (e.g. don't require Infra-deployed dependencies), as we could deploy a
> "bare metal" AMI. If we needed to add the large number of tools Infra
> maintains, creating and updating the AMI would be quite cumbersome. This is
> something that will need to be sorted out if we are to roll out
> general-purpose build nodes 'on-demand'.
>
> Here are some points of note from the thread so far:
>
> - Amazon EC2 Plugin for Jenkins can help
> - GPU nodes desired by some projects
> - Use of auto-scaling groups rather than containers
>
> Projects interested in contributing to setup/design:
>
> - SystemDS
> - Airflow
> - Heron
>
>
>
>
> > On Jan 22, 2022, at 4:29 AM, Janardhan Pulivarthi <
> janardhan.pulivar...@gmail.com> wrote:
> >
> > Hi Chris,
> >
> > At present we would want to use AWS for GPU instances for testing and
> > for building docker (gpu) images.
> >
> > Is there any timeline or design doc.
> >
> > How does the quota work for projects?
> > Would you like to provide iam accounts with specific services in need
> > for a project?
> >
> > Thanks and Regards,
> > Janardhan
> >
> > On Sat, Jan 1, 2022 at 12:19 AM Allen Wittenauer
> > <a...@effectivemachines.com.invalid> wrote:
> >>
> >>
> >>
> >>> On Dec 30, 2021, at 10:58 AM, Chris Lambertus <c...@apache.org> wrote:
> >>>
> >>> Hi folks,
> >>>
> >>> We have some funding to explore providing ephemeral builds via ECS or
> EKS in the Amazon ecosystem, but Infra does not have expertise in this
> area. We would like to integrate such a service with Jenkins.
> >>>
> >>> Does anyone have experience with using these services for CI, and
> would you be interested in assisting Infra in developing a prototype?
> >>>
> >>> Additionally, we may be able to provide some build nodes with GPUs. Do
> we have projects which could/would make use of GPUs for integration testing?
> >>
> >>
> >> At $DAYJOB, I configured the Amazon EC2 plug-in (
> https://plugins.jenkins.io/ec2 ) to do this type of thing using spot
> instances with labels tied to the particular EC2 node type that our jobs
> use.  I avoided using the EC2 Fleet plug-in  (
> https://plugins.jenkins.io/ec2-fleet ) mainly because it always seemed to
> keep at least one node running which is not really want you want to get the
> most bang for your buck. In other words, startup time is less important to
> me than having a node run idle all weekend.
> >>
> >> Biggest issues we’ve hit with this setup are:
> >>
> >> a) Depending upon your spot price, you may get outbid and the node gets
> killed out from underneath you (rarely happens but it does happen with our
> bid)
> >>
> >> b) You need to know ahead of time what types of nodes you want to
> allocate and then set a label to match. For the ASF, that might be tricky
> given a lot of people have no idea what the actual requirements for their
> jobs are.
> >>
> >> c) During a Jenkins restart on rare occasions, the plug-in will ‘lose
> track’ of allocated nodes. We have limits for how long our allocations will
> last  based on # of runs and idle time so generally can spot a ‘stuck’ node
> after a day or so.
> >>
> >> I haven’t tried configuring it use EKS because none of our stuff needs
> k8s yet.
>
>

Re: ephemeral builds via AWS ECS and/or EKS? GPU Nodes?

Reply via email to