GitHub user priya369 created a discussion: Unexpected SIGTERM on Tasks (Airflow 
2.10.5 on GKE with KubernetesExecutor and No Resource Constraints)

### Apache Airflow version

Other Airflow 2 version (please specify below)

### If "Other Airflow 2 version" selected, which one?

2.10.5

### What happened?

We are running Airflow on a GKE cluster with the KubernetesExecutor. Our 
deployment has 1200+ DAGs, with 100+ tasks running concurrently at peak. Since 
the last month, we’ve been observing 20–30 tasks failing daily due to SIGTERM, 
even though the cluster has no resource pressure (CPU/memory/network usage 
stays well under 50%).
We initially suspected it was an issue with the Airflow version (2.9.2), so we 
upgraded to Airflow 2.10.5 — however, the issue still persists.

[dag_id=PCDM_Editorial_Gold_Top_Content_Append_Daily_run_id=scheduled__2025-07-06T07_15_00+00_00_task_id=GcsToSnowflakegold_top_content_editorial_attempt=1.log](https://github.com/user-attachments/files/21493879/dag_id.PCDM_Editorial_Gold_Top_Content_Append_Daily_run_id.scheduled__2025-07-06T07_15_00%2B00_00_task_id.GcsToSnowflakegold_top_content_editorial_attempt.1.log)

### What you think should happen instead?

Tasks should continue running unless explicitly cancelled, timeout, or 
encounter a failure condition. Spontaneous SIGTERM from the Airflow system 
(without user action or resource pressure) should not happen.

### How to reproduce

1. airflow config:  `
executor: "KubernetesExecutor"
allowPodLaunching: true
env:
  - name: "AIRFLOW_GPL_UNIDECODE"
    value: "yes"
  - name: "AIRFLOW__WEBSERVER__BASE_URL"
    value: "*****************************************"
  - name: "AIRFLOW__CORE__LOAD_EXAMPLES"
    value: "False"
  - name: "AIRFLOW__CORE__LAZY_LOAD_PLUGINS"
    value: "False"
  - name: "AIRFLOW__CORE__PLUGINS_FOLDER"
    value: "/opt/airflow/plugins"
  - name: "AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE"
    value: "True"
  - name: "AIRFLOW__CORE__AIRFLOW_HOME"
    value: "/opt/airflow"
  - name: "AIRFLOW__CORE__DAGS_FOLDER"
    value: "/opt/airflow/dags/repo/airflow-dags"
  - name: "AIRFLOW__CORE__PARALLELISM"
    value: "150"
  - name: "AIRFLOW__CORE__DEFAULT_POOL_TASK_SLOT_COUNT"
    value: "200"
  - name: "AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG"
    value: "60"
  - name: "AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG"
    value: "20"
  - name: "AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME"
    value: "64000"
  - name: "AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT"
    value: "500"
  - name: "AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT"
    value: "600"
  - name: "AIRFLOW__CORE__ENABLE_XCOM_PICKLING"
    value: "True"
  - name: "AIRFLOW__CORE__TEST_CONNECTION"
    value: "Enabled"
  - name: "AIRFLOW__CORE__MAX_TEMPLATED_FIELD_LENGTH"
    value: "1000000"
  - name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE"
    value: "900"
  - name: "AIRFLOW__DATABASE__SQL_ALCHEMY_MAX_OVERFLOW"
    value: "200"
  - name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE"
    value: "250"
  - name: "AIRFLOW__API__ENABLE_EXPERIMENTAL_API"
    value: "True"
  - name: "AIRFLOW__API__AUTH_BACKEND"
    value: "airflow.api.auth.backend.default"
  - name: "AIRFLOW__WEBSERVER__EXPOSE_CONFIG"
    value: "True"
  - name: "AIRFLOW__WEBSERVER__SHOW_TRIGGER_FORM_IF_NO_PARAMS"
    value: "True"
  - name: "AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT"
    value: 
"google-cloud-platform://?extra__google_cloud_platform__scope=https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/bigquery";
  - name: "AIRFLOW__LOGGING__LOGGING_LEVEL"
    value: "DEBUG"
  - name: "AIRFLOW__LOGGING__REMOTE_LOGGING"
    value: "True"
  - name: "AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER"
    value: "gs://airflow-cluster/airflow/logs"
  - name: "AIRFLOW__KUBERNETES__DAGS_IN_IMAGE"
    value: "False"
  - name: "AIRFLOW__KUBERNETES__NAMESPACE"
    value: "airflow-data-eng-prod"
  - name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_HOST"
    value: "/opt/airflow/dags/repo/airflow-dags"
  - name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_CLAIM"
    value: "room-prod-pvc-pvc"
  - name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_MOUNT_POINT"
    value: "/opt/airflow/dags/repo/airflow-dags"
  - name: "AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME"
    value: "airflow-worker"
  - name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS"
    value: "True"
  - name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS_ON_FAILURE"
    value: "True"
  - name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE"
    value: "50"
  - name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_QUEUED_CHECK_INTERVAL"
    value: "30"
  - name: "AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION"
    value: "False"
  - name: "AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD"
    value: "300"
  - name: "AIRFLOW__SCHEDULER__FILE_PARSING_SORT_MODE"
    value: "modified_time"
  - name: "AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL"
    value: "700"
  - name: "AIRFLOW__SCHEDULER__PARSING_PROCESSES"
    value: "5"
  - name: "AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC"
    value: "60"
  - name: "AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC"
    value: "120"
  - name: "AIRFLOW__SCHEDULER__ZOMBIE_DETECTION_INTERVAL"
    value: "100"
  - name: "AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL"
    value: "100"
  - name: "AIRFLOW__WEBSERVER__AUTHENTICATE"
    value: "True"
  - name: "AIRFLOW__WEBSERVER__AUTH_BACKEND"
    value: "airflow.contrib.auth.backends.github_enterprise_auth"
  - name: "AIRFLOW__GITHUB_ENTERPRISE__HOST"
    value: "github.com"
  - name: "AIRFLOW__GITHUB_ENTERPRISE__OAUTH_CALLBACK_ROUTE"
    value: "/oauth-authorized/github"  
`

### Operating System

linux

### Versions of Apache Airflow Providers

2.10.5

### Deployment

Official Apache Airflow Helm Chart

### Deployment details

Airflow Version:
Originally 2.9.2, upgraded to 2.10.5 (issue persists in both versions)

Executor:
KubernetesExecutor

Orchestration Platform:
Google Kubernetes Engine (GKE) – Standard mode (not Autopilot)

Cluster Size & Scaling:
~20 nodes (n2-highmem-8)
Autoscaling enabled, but cluster consistently runs at <50% CPU/memory usage
Nodes use standard nodepools, not preemptible or spot instances

Airflow Deployment Mode:
- Deployed via Helm with custom image (Python 3.10 base)
- Webserver, Scheduler,  as separate pods
Task Workloads:
~1200 DAGs, mostly scheduled daily/hourly
~100–150 concurrent tasks at peak times
Heavy usage of:
- DatabricksSubmitRunOperator
- SnowflakeOperator
-ExternalTaskSensor
- ShortCircuitOperator

### Anything else?

We have thoroughly verified that this is not caused by:
- Resource exhaustion (CPU, memory, or disk)
- Pod evictions or node preemptions
- Task timeouts or DAG-level retries
- Manual task terminations
- Implemented dag timeout and execution timeout also created separate pool for 
databricks and snowflake task

We suspect the issue may stem from one of the following:
 -Internal Airflow lifecycle logic (e.g., zombie detection, pod orphan cleanup)
 - K8sExecutor edge cases where the scheduler or triggerer may initiate SIGTERM 
without resource-based justification
 - GCSFuse or sidecar instability possibly impacting task pod lifecycle (though 
main containers are healthy)
 - We are open to instrumenting Airflow internals with logging or tracing if 
the core team can suggest areas to probe (e.g:  executor heartbeat, cleanup 
routines, orphan detection)

We would also be happy to help test a patch or proposed fix in our 
production-like test cluster if needed.

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


GitHub link: https://github.com/apache/airflow/discussions/62978

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to