GitHub user priya369 created a discussion: Unexpected SIGTERM on Tasks (Airflow 2.10.5 on GKE with KubernetesExecutor and No Resource Constraints)
### Apache Airflow version Other Airflow 2 version (please specify below) ### If "Other Airflow 2 version" selected, which one? 2.10.5 ### What happened? We are running Airflow on a GKE cluster with the KubernetesExecutor. Our deployment has 1200+ DAGs, with 100+ tasks running concurrently at peak. Since the last month, we’ve been observing 20–30 tasks failing daily due to SIGTERM, even though the cluster has no resource pressure (CPU/memory/network usage stays well under 50%). We initially suspected it was an issue with the Airflow version (2.9.2), so we upgraded to Airflow 2.10.5 — however, the issue still persists. [dag_id=PCDM_Editorial_Gold_Top_Content_Append_Daily_run_id=scheduled__2025-07-06T07_15_00+00_00_task_id=GcsToSnowflakegold_top_content_editorial_attempt=1.log](https://github.com/user-attachments/files/21493879/dag_id.PCDM_Editorial_Gold_Top_Content_Append_Daily_run_id.scheduled__2025-07-06T07_15_00%2B00_00_task_id.GcsToSnowflakegold_top_content_editorial_attempt.1.log) ### What you think should happen instead? Tasks should continue running unless explicitly cancelled, timeout, or encounter a failure condition. Spontaneous SIGTERM from the Airflow system (without user action or resource pressure) should not happen. ### How to reproduce 1. airflow config: ` executor: "KubernetesExecutor" allowPodLaunching: true env: - name: "AIRFLOW_GPL_UNIDECODE" value: "yes" - name: "AIRFLOW__WEBSERVER__BASE_URL" value: "*****************************************" - name: "AIRFLOW__CORE__LOAD_EXAMPLES" value: "False" - name: "AIRFLOW__CORE__LAZY_LOAD_PLUGINS" value: "False" - name: "AIRFLOW__CORE__PLUGINS_FOLDER" value: "/opt/airflow/plugins" - name: "AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE" value: "True" - name: "AIRFLOW__CORE__AIRFLOW_HOME" value: "/opt/airflow" - name: "AIRFLOW__CORE__DAGS_FOLDER" value: "/opt/airflow/dags/repo/airflow-dags" - name: "AIRFLOW__CORE__PARALLELISM" value: "150" - name: "AIRFLOW__CORE__DEFAULT_POOL_TASK_SLOT_COUNT" value: "200" - name: "AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG" value: "60" - name: "AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG" value: "20" - name: "AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME" value: "64000" - name: "AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT" value: "500" - name: "AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT" value: "600" - name: "AIRFLOW__CORE__ENABLE_XCOM_PICKLING" value: "True" - name: "AIRFLOW__CORE__TEST_CONNECTION" value: "Enabled" - name: "AIRFLOW__CORE__MAX_TEMPLATED_FIELD_LENGTH" value: "1000000" - name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE" value: "900" - name: "AIRFLOW__DATABASE__SQL_ALCHEMY_MAX_OVERFLOW" value: "200" - name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE" value: "250" - name: "AIRFLOW__API__ENABLE_EXPERIMENTAL_API" value: "True" - name: "AIRFLOW__API__AUTH_BACKEND" value: "airflow.api.auth.backend.default" - name: "AIRFLOW__WEBSERVER__EXPOSE_CONFIG" value: "True" - name: "AIRFLOW__WEBSERVER__SHOW_TRIGGER_FORM_IF_NO_PARAMS" value: "True" - name: "AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT" value: "google-cloud-platform://?extra__google_cloud_platform__scope=https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/bigquery" - name: "AIRFLOW__LOGGING__LOGGING_LEVEL" value: "DEBUG" - name: "AIRFLOW__LOGGING__REMOTE_LOGGING" value: "True" - name: "AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER" value: "gs://airflow-cluster/airflow/logs" - name: "AIRFLOW__KUBERNETES__DAGS_IN_IMAGE" value: "False" - name: "AIRFLOW__KUBERNETES__NAMESPACE" value: "airflow-data-eng-prod" - name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_HOST" value: "/opt/airflow/dags/repo/airflow-dags" - name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_CLAIM" value: "room-prod-pvc-pvc" - name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_MOUNT_POINT" value: "/opt/airflow/dags/repo/airflow-dags" - name: "AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME" value: "airflow-worker" - name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS" value: "True" - name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS_ON_FAILURE" value: "True" - name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE" value: "50" - name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_QUEUED_CHECK_INTERVAL" value: "30" - name: "AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION" value: "False" - name: "AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD" value: "300" - name: "AIRFLOW__SCHEDULER__FILE_PARSING_SORT_MODE" value: "modified_time" - name: "AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL" value: "700" - name: "AIRFLOW__SCHEDULER__PARSING_PROCESSES" value: "5" - name: "AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC" value: "60" - name: "AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC" value: "120" - name: "AIRFLOW__SCHEDULER__ZOMBIE_DETECTION_INTERVAL" value: "100" - name: "AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL" value: "100" - name: "AIRFLOW__WEBSERVER__AUTHENTICATE" value: "True" - name: "AIRFLOW__WEBSERVER__AUTH_BACKEND" value: "airflow.contrib.auth.backends.github_enterprise_auth" - name: "AIRFLOW__GITHUB_ENTERPRISE__HOST" value: "github.com" - name: "AIRFLOW__GITHUB_ENTERPRISE__OAUTH_CALLBACK_ROUTE" value: "/oauth-authorized/github" ` ### Operating System linux ### Versions of Apache Airflow Providers 2.10.5 ### Deployment Official Apache Airflow Helm Chart ### Deployment details Airflow Version: Originally 2.9.2, upgraded to 2.10.5 (issue persists in both versions) Executor: KubernetesExecutor Orchestration Platform: Google Kubernetes Engine (GKE) – Standard mode (not Autopilot) Cluster Size & Scaling: ~20 nodes (n2-highmem-8) Autoscaling enabled, but cluster consistently runs at <50% CPU/memory usage Nodes use standard nodepools, not preemptible or spot instances Airflow Deployment Mode: - Deployed via Helm with custom image (Python 3.10 base) - Webserver, Scheduler, as separate pods Task Workloads: ~1200 DAGs, mostly scheduled daily/hourly ~100–150 concurrent tasks at peak times Heavy usage of: - DatabricksSubmitRunOperator - SnowflakeOperator -ExternalTaskSensor - ShortCircuitOperator ### Anything else? We have thoroughly verified that this is not caused by: - Resource exhaustion (CPU, memory, or disk) - Pod evictions or node preemptions - Task timeouts or DAG-level retries - Manual task terminations - Implemented dag timeout and execution timeout also created separate pool for databricks and snowflake task We suspect the issue may stem from one of the following: -Internal Airflow lifecycle logic (e.g., zombie detection, pod orphan cleanup) - K8sExecutor edge cases where the scheduler or triggerer may initiate SIGTERM without resource-based justification - GCSFuse or sidecar instability possibly impacting task pod lifecycle (though main containers are healthy) - We are open to instrumenting Airflow internals with logging or tracing if the core team can suggest areas to probe (e.g: executor heartbeat, cleanup routines, orphan detection) We would also be happy to help test a patch or proposed fix in our production-like test cluster if needed. ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) GitHub link: https://github.com/apache/airflow/discussions/62978 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
