On Tue, Oct 26, 2021 at 04:55:08PM +0200, Philippe Mathieu-Daudé wrote: > Hi, > > I guess I got very unlucky because I happened to pull the docker > images from the mainstream registry in a very short time frame > where they jumped back in time... Then I kept using them to run > my tests during 1 week trying to understand why I was having odd > build failures. Then I realize the Ubuntu docker images were out > of sync. I pulled again and it was working, so I searched for the > mainstream job producing the outdated images and found a pipeline > pushing the 'stable-6.0-staging' branch. This branch doesn't > contain the recent gitlab-ci and Dockerfile changes... > > Similarly, this branch doesn't contain commit eafadbbbac0 > ("gitlab: only let pages be published from default branch") so > outdated documentation got pushed for a short time. > > This patch won't fix branches pushed from the past, but at least > it should avoid to reproduce this problem in the future. > > Any idea how to improve the GitLab infrastructure to avoid these > kind of problems in the future? Is it possible to enforce > restrictions from the project configuration, rather than the > repository YAML file?
The problem here is the general way we are handling docker images in the CI pipeline. In the first stage of the pipeline we build and publish images with a tag "$reponame/$imagename:latest". We need to publish them because the next stage of the pipeline has to be able to pull the just built images - stages can't directly inherit the docker images from each stages without going via the registry. The use of ":latest" here is the root cause of our problems. It works only under the assumptions: - you have a single pipeline running for any repository at any point in time - you only need a single maintainance stage (master) The first assumption was never really true. In the main qemu-project namespace we sometimes end up ith pipelines runnning on 'master' for just pushed series, and for 'staging' for series being tested for merge. Most of the time we get away with this but we've seen a few rare CI failures from it. Similarly in user forks most of the time we get away with it, but sometimes users might quickly push to different branches. In all cases the race is only a problem when a patch series contains dockerfile changes, which is why it is such a rare problem So far we've only been relying on the CI pipeline for pushes to master, since stable branch maint has been awol for a while. We're soon going to starting violating this assumption though. IOW, we can't keep relying on ":latest", as its going to cause increasing trouble. The reason we picked ":latest" though is because gitlab CI config is a bit inflexible: - It has to be either a fixed string, or the contents of a standard environment variable present in gitlab: https://docs.gitlab.com/ee/ci/variables/predefined_variables.html We can't populate our own env variables dynamically, nor programatically define a tag - The tag contents has to satisfy the Docker restrictions on valid characters in tags "A tag name must be valid ASCII and may contain lowercase and uppercase letters, digits, underscores, periods and dashes. A tag name may not start with a period or a dash and may contain a maximum of 128 characters." Essentially we need two tags - One that is only for use during execution of the current pipeline - One that is published only on success of the pipeline on master, to serve as cache for future pipelines, or to let developer pull Notably the latter is more restrictive that git branch names. We could assume users always have "sensible" branch names that are less than 128 chars and only alpha-num characters plus dash/underscore. This would be fine for my personal branch naming, but I wonder if anyone uses wierd branch names that would violate docker tag name rules ? Perhaps we just accept that risk and have the CI job that builds the container fail + print a clear message to user to rename their branch to something sensible. Ideally we would take a sha256 sum of the dockerfile (and parent layers it inherits from) and use that as the image tag. That fails with the constraint number 1 though - we can't programmatically set tags. This would maximise our cache hit rate though, compared to the single "latest" tag we publish. The other alternative option here is to use the current git commit hash to as the image tag, since this is valid docker tag name and is unique for the dockerfile contents even if multiple pipelines run for the same commit. An even more unique option is to use the unique pipeline ID as the tag. Using either commit hash or pipeline ID will lead to an explosion in the number of tags present in the repository's container registry. GitLab has ability to periodically cleanup old tags but on all my repos this appears disabled by default. We could manually delete the tag, at the end of the pipeline, but that causes trouble if multiple pipelines for the same commit hash are runing concurrently, as the tag might still be needed by something later. Also it means the developer can't easily reproduce problems using the *exact* image the pipeline used. Maybe we put something at the start of the pipeline that manualyl deletes any obsolete tags (eg > 7 days old) from previous pipelines. This re-invents gitlab tag cleanup, in a way that doesn't require every develoiper to toggle a setting in their repo. So my overall inclination is - Modify ".container_job_template" - If image tag matching the sha256 sum of the dockerfile(s) exists: - Publish new tag based on $CI_PIPELINE_IID, copying the sha256 tag - Else - Build new image from scratch, no caching - Publish new tag based on sha256 sum of the dockerfile(s) - Publish new tag based on $CI_PIPELINE_IID - Modify ".native_build_job_template" - Use $CI_REGISTRY_IMAGE/qemu/$IMAGE:$CI_PIPELINE_IID - Add a final job - Publish a new tag ":latest" based on the image built earlier, *if* this branch is "master" and all jobs succeeded Possibly, we could add ":master" and ":stable-XXX" too as convenience tags ? - Add an early job - Delete all $CI_PIPELINE_IID based tags older than 7 days - Delete all sha256 sum tags older than 7 days *provided* the tag does not match the current sha256 content In the common case I believe this will make our pipelines faster, because publishing the $CI_PIPELINE_IID based oan copy of the sha256 tag can be done without even pulling down the image. It is a pure metadata operation from docker registry POV. Our current caching by contrast needs to pull down existing images to see if their content is usable, which takes time. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|