Hi,
On February 26th GitLab CI started failing many jobs because they
could not be scheduled. I've been unable to merge pull requests
because the CI is not working.

Here is an example failed job:
https://gitlab.com/qemu-project/qemu/-/jobs/9281757413

One issue seems to be that the gitlab-cache-pvc PVC is ReadWriteOnce
and Pods scheduled on new nodes therefore cannot start until existing
Pods running on another node complete, causing gitlab-runner timeouts
and failed jobs.

When trying to figure out how the Digital Ocean Kubernetes cluster is
configured I noticed that the
digitalocean-runner-manager-gitlab-runner ConfigMap created on
2024-12-03 does not match qemu.git's
scripts/ci/gitlab-kubernetes-runners/values.yaml. Here is the diff:
--- /tmp/upstream.yaml    2025-03-01 12:47:40.495216401 +0800
+++ /tmp/deployed.yaml    2025-03-01 12:47:38.884216210 +0800
@@ -9,6 +9,7 @@
   [runners.kubernetes]
     poll_timeout = 1200
     image = "ubuntu:20.04"
+    privileged = true
     cpu_request = "0.5"
     service_cpu_request = "0.5"
     helper_cpu_request = "0.25"
@@ -18,5 +19,6 @@
     name = "docker-certs"
     mount_path = "/certs/client"
     medium = "Memory"
-  [runners.kubernetes.node_selector]
-    agentpool = "jobs"
+  [[runners.kubernetes.volumes.pvc]]
+    name = "gitlab-cache-pvc"
+    mount_path = "/cache"

The cache PVC appears to be a manual addition made to the running
cluster but not committed to qemu.git. I don't understand why the
problems only started surfacing now. Maybe a recent .gitlab-ci.d/
change changed how the timeout behaves or maybe the gitlab-runner
configuration that enables the cache PVC simply wasn't picked up by
the gitlab-runner Pod until February 26th?

In the short term I made a manual edit to the ConfigMap removing
gitlab-cache-pvc (but I didn't delete the PVC resource itself). Jobs
are at least running now, although they may take longer due to the
lack of cache.

In the long term maybe we should deploy minio
(https://github.com/minio/minio) or another Kubernetes S3-like service
so gitlab-runner can properly use a global cache without ReadWriteOnce
limitations?

Since I don't know the details of how the Digital Ocean Kubernetes
cluster was configured for gitlab-runner I don't want to make too many
changes without your input. Please let me know what you think.

Stefan

Reply via email to