Seems reasonable to me, having both `scheduler` and `schedulers` is a bit odd, but I see the reasoning for back compat. We can eventually deprecate `scheduler`.
Is there any way we can get some executor state returned in this new data? If we're expanding it anyway? Also you tagged this as a [DISCUSS] thread in the subject, but also proposed a lazy consensus, let's maybe discuss it a bit and then propose a lazy consensus in another email thread. Cheers, Niko On Wed, Jun 24, 2026 at 5:06 PM Jung-Hyun Kim <[email protected]> wrote: > The Problem > In distributed Airflow environments running multiple schedulers, the > current health endpoint contains a significant monitoring blind spot. > Currently, the health check determines the status of the scheduler by > querying the metadata database using the most_recent_job method found in > job.py: > > @provide_session > def most_recent_job(job_type: str, *, session: Session = NEW_SESSION) -> > Job | None: > """ > Return the most recent job of this type, if any, based on last > heartbeat received. > > Jobs in "running" state take precedence over others to make sure alive > job is returned if it is available. > > :param job_type: job type to query for to get the most recent job for > :param session: Database session > :end_date: None > """ > return session.scalar( > select(Job) > .where(Job.job_type == job_type) > .order_by( > # Put "running" jobs at the front. > case({JobState.RUNNING: 0}, value=Job.state, else_=1), > Job.latest_heartbeat.desc(), > ) > .limit(1) > ) > > > This database query explicitly sorts records by the RUNNING state and > applies .limit(1), returning only a single, absolute newest job record. > This result is then processed in airflow_health.py via the > get_airflow_health endpoint method: > > def get_airflow_health() -> dict[str, Any]: > """Get the health for Airflow metadatabase, scheduler and triggerer.""" > metadatabase_status = HEALTHY > latest_scheduler_heartbeat = None > latest_triggerer_heartbeat = None > latest_dag_processor_heartbeat = None > > scheduler_status = UNHEALTHY > triggerer_status: str | None = UNHEALTHY > dag_processor_status: str | None = UNHEALTHY > > try: > latest_scheduler_job = SchedulerJobRunner.most_recent_job() > > if latest_scheduler_job: > if latest_scheduler_job.latest_heartbeat: > latest_scheduler_heartbeat = > latest_scheduler_job.latest_heartbeat.isoformat() > if latest_scheduler_job.is_alive(): > scheduler_status = HEALTHY > except Exception: > metadatabase_status = UNHEALTHY > > > Because the health endpoint evaluates only the single job returned by > most_recent_job(), the check can only ever validate the health of one > scheduler at a time. > In a distributed deployment with multiple active schedulers, if even one > instance is running cleanly, the endpoint will flag as healthy even if all > other parallel scheduler instances have gone down. > To get meaningful information regarding the scheduler status from the > health endpoint it is worth it to monitor every scheduler in the > distributed environment instead of just a single scheduler. > The Proposed Solution > To deal with this problem we can add a new field called schedulers (plural > for multiple schedulers) in the health endpoint that returns a 3-tier > aggregated status that covers the following: > > * > HEALTHY: All registered scheduler instances are fully operational and > actively heartbeating. > * > DEGRADED: At least one scheduler instance is down or failing, but at least > one remaining instance is still working. > * > DOWN: All scheduler instances have failed or stopped working. > > Per-Instance Diagnostic Breakdown > We should also add a per instance breakdown as a nested list that will > show the following: > > 1. > hostname > 2. > status: Individual status > 3. > latest_heartbeat > > Example > > { > "metadatabase": { > "status": "healthy" > }, > "scheduler": { > "scheduler_status": "healthy", > "latest_scheduler_heartbeat": "2026-06-24T23:15:02+00:00" > }, > "schedulers": { > "status": "DEGRADED", > "instances": [ > { > "hostname": "scheduler-ha-instance-1", > "status": "HEALTHY", > "latest_heartbeat": "2026-06-24T23:15:02+00:00" > }, > { > "hostname": "scheduler-ha-instance-2", > "status": "DOWN", > "latest_heartbeat": "2026-06-24T23:10:14+00:00" > }, > { > "hostname": "scheduler-ha-instance-3", > "status": "HEALTHY", > "latest_heartbeat": "2026-06-24T23:14:59+00:00" > } > ] > } > } > > Could end up looking roughly like this, resulting in a more meaningful > health endpoint that will make it easier to diagnose issues with the > scheduler. This is a LAZY CONSENSUS proposal. >
