Hi, On February 26th GitLab CI started failing many jobs because they could not be scheduled. I've been unable to merge pull requests because the CI is not working.
Here is an example failed job: https://gitlab.com/qemu-project/qemu/-/jobs/9281757413 One issue seems to be that the gitlab-cache-pvc PVC is ReadWriteOnce and Pods scheduled on new nodes therefore cannot start until existing Pods running on another node complete, causing gitlab-runner timeouts and failed jobs. When trying to figure out how the Digital Ocean Kubernetes cluster is configured I noticed that the digitalocean-runner-manager-gitlab-runner ConfigMap created on 2024-12-03 does not match qemu.git's scripts/ci/gitlab-kubernetes-runners/values.yaml. Here is the diff: --- /tmp/upstream.yaml 2025-03-01 12:47:40.495216401 +0800 +++ /tmp/deployed.yaml 2025-03-01 12:47:38.884216210 +0800 @@ -9,6 +9,7 @@ [runners.kubernetes] poll_timeout = 1200 image = "ubuntu:20.04" + privileged = true cpu_request = "0.5" service_cpu_request = "0.5" helper_cpu_request = "0.25" @@ -18,5 +19,6 @@ name = "docker-certs" mount_path = "/certs/client" medium = "Memory" - [runners.kubernetes.node_selector] - agentpool = "jobs" + [[runners.kubernetes.volumes.pvc]] + name = "gitlab-cache-pvc" + mount_path = "/cache" The cache PVC appears to be a manual addition made to the running cluster but not committed to qemu.git. I don't understand why the problems only started surfacing now. Maybe a recent .gitlab-ci.d/ change changed how the timeout behaves or maybe the gitlab-runner configuration that enables the cache PVC simply wasn't picked up by the gitlab-runner Pod until February 26th? In the short term I made a manual edit to the ConfigMap removing gitlab-cache-pvc (but I didn't delete the PVC resource itself). Jobs are at least running now, although they may take longer due to the lack of cache. In the long term maybe we should deploy minio (https://github.com/minio/minio) or another Kubernetes S3-like service so gitlab-runner can properly use a global cache without ReadWriteOnce limitations? Since I don't know the details of how the Digital Ocean Kubernetes cluster was configured for gitlab-runner I don't want to make too many changes without your input. Please let me know what you think. Stefan