Hi Joan,

I’m adding CC: vp-infra for visibility.

The reason Infra aggressively prunes docker cache on the build nodes is that 
projects generally do not clean up after their builds, and run the nodes out of 
space with infinite docker images. Our donated general purpose build hardware 
resources are limited, but this automated cache purging could be removed from 
project sponsored nodes if the project was willing to manage their own disk 
usage.

Some notes:

- Infra supports credentialed builds using secrets in a variety of ways. 
- Infra really wants to avoid per-project secret escrow
- the ASF does not have a relationship (that I am aware of) with DockerHub
- I’m skeptical that our builds collectively are pulling this many docker 
images in this short a timespan per-node (we have ~50 nodes). 

I took a quick google tour of open source docker registries, and there are a 
few that we could potentially deploy, but without any kind of metrics available 
to us, I’m inclined to let the situation evolve. Presuming Dockerhub implements 
these rules, I would hope that usage metrics would become available, and we 
could then decide how best to approach a solution.

Joan, by CC:ing vp-infra I have brought this topic to the attention of both 
Infra and the Board. Thank you for your assessment and concern. I hope this 
does not become a major issue before we can address it.

-C



> On Oct 28, 2020, at 9:33 PM, Dave Fisher <wave4d...@comcast.net> wrote:
> 
> Hi Joan,
> 
> I was vaguely concerned when I got that email.
> 
> Thank you for sounding the alarm and presenting a reasonable choice of 
> actions.
> 
> The idea of an Apache Infra docker repos Is great! (Maybe dist.apache.org 
> works?) A significant amount of time in Jenkins is spent building images 
> which don’t change often. (Have we avoided proper understanding of the cache 
> pattern?)
> 
> Best Regards and hope you are well,
> Dave
> 
> Sent from my iPhone
> 
>> On Oct 28, 2020, at 9:02 PM, Joan Touzet <woh...@apache.org> wrote:
>> 
>> Got your attention?
>> 
>> Here's what arrived in my inbox around 4 hours ago:
>> 
>>> You are receiving this email because of a policy change to Docker products 
>>> and services you use. On Monday, November 2, 2020 at 9am Pacific Standard 
>>> Time, Docker will begin enforcing rate limits on container pulls for 
>>> Anonymous and Free users. Anonymous (unauthenticated) users will be limited 
>>> to 100 container image pulls every six hours, and Free (authenticated) 
>>> users will be limited to 200 container image pulls every six hours, when 
>>> enforcement is fully implemented. 
>> 
>> Their referenced blog posts are here:
>> 
>> https://www.docker.com/blog/scaling-docker-to-serve-millions-more-developers-network-egress/
>> 
>> https://www.docker.com/blog/understanding-inner-loop-development-and-pull-rates/
>> 
>> Since I haven't seen this discussed on the builds list yet (and I'm not
>> subscribed to users@infra), I wanted to make clear the impact. I would
>> bet that just about every workflow using Jenkins, buildbot, GHA or
>> otherwise uses uncredential-ed `docker pull` commands. If you're using
>> the shared Apache CI workers, every pull you're making is counting
>> towards this 100 pulls/6 hour limit. Multiply that by every ASF project
>> on those servers, and multiply that again by the total number of PRs /
>> change requetss / builds per project, and.... :(
>> 
>> Apache's going to hit these new limits real fast. And we must act fast
>> to avoid problems, as those new limits kick in **MONDAY**.
>> 
>> Even for those of us lucky enough to have sponsorship for dedicated CI
>> workers, it's still a problem. Infra has scripts to wipe all
>> not-currently-in-use Docker containers off of each machine every 24
>> hours (or did, last I looked). That means you can't rely on local
>> caching. Other projects may also have added --force to their `docker
>> pull` requests in their CI workflows, to work around issues with cached,
>> corrupted downloads (a big problem for us on the shared CI
>> infrastructure), or to work around issues with the :latest tag caching
>> when it shouldn't.
>> 
>> This extends beyond projects using CI in the way Docker outlines on
>> their second blog post linked above, namely their encouragement to use
>> multi-stage builds. If local caching can't be relied on, there's no
>> advantage. If what's being pulled down is an image containing that
>> project's full build environment - this is what CouchDB does and I
>> expect others do as well, as setting up our build environment, even
>> automated, takes 30-45 minutes - frequent changes to the build
>> dependencies require frequent pulls of those images, which cannot be
>> mitigated via the Docker-recommended multi-stage builds.
>> 
>> =====
>> 
>> Proposed solutions:
>> 
>> 1. Infra provides credentialed logins through the Docker Hub apache
>> organisation to projects. Every project would have to update their
>> Jenkins/buildbot/GHA/etc workflows to consume and use these credentials
>> for every `docker pull` command. This depends on Apache actually being
>> exempted for the new limits (I'm not sure, are we?) and those creds
>> being distributed widely...which may run into Infra Policy issues.
>> 
>> 2. Infra provides their own Docker registry. Projects that need images
>> can host them there. These will be automatically exempt. Infra will have
>> to plan for sufficient storage (this will get big *fast*) and bandwidth
>> (same). They will also have to firewall it off from non-Apache projects.
>> 
>> This should be configured as a pull through caching registry, so that
>> attempts to `docker pull docker.apache.org/ubuntu:latest` will
>> automatically reach out to hub.docker.com and store that image locally.
>> Infra can populate this registry with credentials within the ASF Docker
>> Hub org that are, hopefully, exempt from these requirements.
>> 
>> 3. Like #2, but per-project, on Infra-provided VMs. Today this is not
>> practical, as the standard Infra-provided VM only has ~20GB of local
>> storage. Just a handful of Docker images will eat that space nearly
>> immediately.
>> 
>> ===
>> 
>> I think #2 above is the most logical and expedient, but it requires a
>> commitment from Infra to make happen - and to get the message out - with
>> only 4 days until DOOM.
>> 
>> What does the list think? More importantly, what does Infra think?
>> 
>> -Joan "I'm gonna sing The Doom Song now!" Touzet
> 

Reply via email to