The Problem
In distributed Airflow environments running multiple schedulers, the current
health endpoint contains a significant monitoring blind spot.
Currently, the health check determines the status of the scheduler by querying
the metadata database using the most_recent_job method found in job.py:
@provide_session
def most_recent_job(job_type: str, *, session: Session = NEW_SESSION) -> Job |
None:
"""
Return the most recent job of this type, if any, based on last heartbeat
received.
Jobs in "running" state take precedence over others to make sure alive
job is returned if it is available.
:param job_type: job type to query for to get the most recent job for
:param session: Database session
:end_date: None
"""
return session.scalar(
select(Job)
.where(Job.job_type == job_type)
.order_by(
# Put "running" jobs at the front.
case({JobState.RUNNING: 0}, value=Job.state, else_=1),
Job.latest_heartbeat.desc(),
)
.limit(1)
)
This database query explicitly sorts records by the RUNNING state and applies
.limit(1), returning only a single, absolute newest job record.
This result is then processed in airflow_health.py via the get_airflow_health
endpoint method:
def get_airflow_health() -> dict[str, Any]:
"""Get the health for Airflow metadatabase, scheduler and triggerer."""
metadatabase_status = HEALTHY
latest_scheduler_heartbeat = None
latest_triggerer_heartbeat = None
latest_dag_processor_heartbeat = None
scheduler_status = UNHEALTHY
triggerer_status: str | None = UNHEALTHY
dag_processor_status: str | None = UNHEALTHY
try:
latest_scheduler_job = SchedulerJobRunner.most_recent_job()
if latest_scheduler_job:
if latest_scheduler_job.latest_heartbeat:
latest_scheduler_heartbeat =
latest_scheduler_job.latest_heartbeat.isoformat()
if latest_scheduler_job.is_alive():
scheduler_status = HEALTHY
except Exception:
metadatabase_status = UNHEALTHY
Because the health endpoint evaluates only the single job returned by
most_recent_job(), the check can only ever validate the health of one scheduler
at a time.
In a distributed deployment with multiple active schedulers, if even one
instance is running cleanly, the endpoint will flag as healthy even if all
other parallel scheduler instances have gone down.
To get meaningful information regarding the scheduler status from the health
endpoint it is worth it to monitor every scheduler in the distributed
environment instead of just a single scheduler.
The Proposed Solution
To deal with this problem we can add a new field called schedulers (plural for
multiple schedulers) in the health endpoint that returns a 3-tier aggregated
status that covers the following:
*
HEALTHY: All registered scheduler instances are fully operational and actively
heartbeating.
*
DEGRADED: At least one scheduler instance is down or failing, but at least one
remaining instance is still working.
*
DOWN: All scheduler instances have failed or stopped working.
Per-Instance Diagnostic Breakdown
We should also add a per instance breakdown as a nested list that will show the
following:
1.
hostname
2.
status: Individual status
3.
latest_heartbeat
Example
{
"metadatabase": {
"status": "healthy"
},
"scheduler": {
"scheduler_status": "healthy",
"latest_scheduler_heartbeat": "2026-06-24T23:15:02+00:00"
},
"schedulers": {
"status": "DEGRADED",
"instances": [
{
"hostname": "scheduler-ha-instance-1",
"status": "HEALTHY",
"latest_heartbeat": "2026-06-24T23:15:02+00:00"
},
{
"hostname": "scheduler-ha-instance-2",
"status": "DOWN",
"latest_heartbeat": "2026-06-24T23:10:14+00:00"
},
{
"hostname": "scheduler-ha-instance-3",
"status": "HEALTHY",
"latest_heartbeat": "2026-06-24T23:14:59+00:00"
}
]
}
}
Could end up looking roughly like this, resulting in a more meaningful health
endpoint that will make it easier to diagnose issues with the scheduler. This
is a LAZY CONSENSUS proposal.