Hey,
Agreeing with Dennis here, the proposal looks good. I'd suggest extending it to
cover the other components that can run as multiple instances in an Airflow
environment: the dag processor and the triggerer.
The primary goal should be to support environments running multiple instances
of the same component, regardless of whether the setup is multi-team. That
said, in multi-team environments, it would be valuable to surface team
ownership per component. For example, if a dag processor goes down, we should
be able to immediately identify which team is affected.
Vincent
On 2026/06/25 16:53:13 "Ferruzzi, Dennis" wrote:
> Thanks for looking into this. I'm a strong +1 with a caveat (see below)
>
> One alarming side effect of the way this is calculated and reported is that
> in a multi-team environment, Team_1's scheduler may be down entirely and the
> current dashboard will report HEALTHY as long as at least one other team is
> live. I'm not sure if we can squeeze this in as a bug-fix because that seems
> like a gap we should fix. Teghveer just confirmed last night that the same
> calculation/reporting is being used for the Triggerers as well, so I am
> amending your proposal; whatever we decide here will also be applied to
> Triggerers as well. (Teghveer is willing to do that part of he work in
> parallel once we have a consensus.)
>
> That said, here's my opinion on the proposal:
>
> Caveat: I like the proposal, on the condition that the value reported by the
> existing "scheduler" is unchanged. We can (and should?) deprecate and remove
> that in a future release with instructions to move to using "schedulers"
> instead, but for now we can't break existing monitoring.
>
> Additional non-blocking suggestion: Also, let's add the "team_name" in the
> individual scheduler schema since it's available:
>
> "schedulers": {
> "status": "DEGRADED",
> "instances": [
> {
> "hostname": "scheduler-ha-instance-1",
> "status": "HEALTHY",
> "team_name": "team_1",
> "latest_heartbeat": "2026-06-24T23:15:02+00:00"
> },
> {
> "hostname": "scheduler-ha-instance-2",
> "status": "DOWN",
> "team_name": "team_2",
> "latest_heartbeat": "2026-06-24T23:10:14+00:00"
> },
> etc...
> }
>
> - ferrruzi
> ________________________________
> From: Jung-Hyun Kim <[email protected]>
> Sent: Wednesday, June 24, 2026 5:01 PM
> To: [email protected] <[email protected]>
> Subject: [EXT] [DISCUSS] /api/v2/monitor/health endpoint does not give
> meaningful information into Scheduler statuses
>
> CAUTION: This email originated from outside of the organization. Do not click
> links or open attachments unless you can confirm the sender and know the
> content is safe.
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne
> cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas
> confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le
> contenu ne présente aucun risque.
>
>
>
> The Problem
> In distributed Airflow environments running multiple schedulers, the current
> health endpoint contains a significant monitoring blind spot.
> Currently, the health check determines the status of the scheduler by
> querying the metadata database using the most_recent_job method found in
> job.py:
>
> @provide_session
> def most_recent_job(job_type: str, *, session: Session = NEW_SESSION) -> Job
> | None:
> """
> Return the most recent job of this type, if any, based on last heartbeat
> received.
>
> Jobs in "running" state take precedence over others to make sure alive
> job is returned if it is available.
>
> :param job_type: job type to query for to get the most recent job for
> :param session: Database session
> :end_date: None
> """
> return session.scalar(
> select(Job)
> .where(Job.job_type == job_type)
> .order_by(
> # Put "running" jobs at the front.
> case({JobState.RUNNING: 0}, value=Job.state, else_=1),
> Job.latest_heartbeat.desc(),
> )
> .limit(1)
> )
>
>
> This database query explicitly sorts records by the RUNNING state and applies
> .limit(1), returning only a single, absolute newest job record.
> This result is then processed in airflow_health.py via the get_airflow_health
> endpoint method:
>
> def get_airflow_health() -> dict[str, Any]:
> """Get the health for Airflow metadatabase, scheduler and triggerer."""
> metadatabase_status = HEALTHY
> latest_scheduler_heartbeat = None
> latest_triggerer_heartbeat = None
> latest_dag_processor_heartbeat = None
>
> scheduler_status = UNHEALTHY
> triggerer_status: str | None = UNHEALTHY
> dag_processor_status: str | None = UNHEALTHY
>
> try:
> latest_scheduler_job = SchedulerJobRunner.most_recent_job()
>
> if latest_scheduler_job:
> if latest_scheduler_job.latest_heartbeat:
> latest_scheduler_heartbeat =
> latest_scheduler_job.latest_heartbeat.isoformat()
> if latest_scheduler_job.is_alive():
> scheduler_status = HEALTHY
> except Exception:
> metadatabase_status = UNHEALTHY
>
>
> Because the health endpoint evaluates only the single job returned by
> most_recent_job(), the check can only ever validate the health of one
> scheduler at a time.
> In a distributed deployment with multiple active schedulers, if even one
> instance is running cleanly, the endpoint will flag as healthy even if all
> other parallel scheduler instances have gone down.
> To get meaningful information regarding the scheduler status from the health
> endpoint it is worth it to monitor every scheduler in the distributed
> environment instead of just a single scheduler.
> The Proposed Solution
> To deal with this problem we can add a new field called schedulers (plural
> for multiple schedulers) in the health endpoint that returns a 3-tier
> aggregated status that covers the following:
>
> *
> HEALTHY: All registered scheduler instances are fully operational and
> actively heartbeating.
> *
> DEGRADED: At least one scheduler instance is down or failing, but at least
> one remaining instance is still working.
> *
> DOWN: All scheduler instances have failed or stopped working.
>
> Per-Instance Diagnostic Breakdown
> We should also add a per instance breakdown as a nested list that will show
> the following:
>
> 1.
> hostname
> 2.
> status: Individual status
> 3.
> latest_heartbeat
>
> Example
>
> {
> "metadatabase": {
> "status": "healthy"
> },
> "scheduler": {
> "scheduler_status": "healthy",
> "latest_scheduler_heartbeat": "2026-06-24T23:15:02+00:00"
> },
> "schedulers": {
> "status": "DEGRADED",
> "instances": [
> {
> "hostname": "scheduler-ha-instance-1",
> "status": "HEALTHY",
> "latest_heartbeat": "2026-06-24T23:15:02+00:00"
> },
> {
> "hostname": "scheduler-ha-instance-2",
> "status": "DOWN",
> "latest_heartbeat": "2026-06-24T23:10:14+00:00"
> },
> {
> "hostname": "scheduler-ha-instance-3",
> "status": "HEALTHY",
> "latest_heartbeat": "2026-06-24T23:14:59+00:00"
> }
> ]
> }
> }
>
> Could end up looking roughly like this, resulting in a more meaningful health
> endpoint that will make it easier to diagnose issues with the scheduler. This
> is a LAZY CONSENSUS proposal.
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]