... apparently I can heart-react emails in Outlooks now?  If that went out to 
the dev list, I apologize, it was a mis-click.

Vincent - Do you have any ideas or suggestions for mutli-team-ifying the layout 
other than just adding team_name to the individual components where it is 
available?   It might be cool to add a new top-level entry for each team and 
list their resources.

It might be nice to fold the entire existing dash into "global" but that would 
break everything and it would be an awkward extra layer for non-multi-team, so 
let's not:

  dag_processors/
  ├── status
  ├── heartbeat
  ├── dp_1/
  │   ├── status
  │   ├── heartbeat
  │   └── team_name
  └── dp_2/
      └── ...etc
  metadatabase/
  └── status
  schedulers/
  ├── status
  ├── heartbeat
  ├── sched_1/
  │   ├── status
  │   ├── heartbeat
  │   └── team_name
  ├── sched_2/
  │   ├── status
  │   ├── heartbeat
  │   └── team_name
  └── ...etc

  team_1/
  └── dag_processors/
      ├── status
      ├── heartbeat
      └── dp_1/
          ├── status
          ├── heartbeat
          └── team_name  # redundant with the team tiering but assuming we're 
reusing the same data from above
      └── ...etc

  team_2/
  └── dag_processors/
      ├── status
      ├── heartbeat
      └── dp_2/
          ├── status
          ├── heartbeat
          └── team_name
      └── ...etc
________________________________
From: Vincent Beck <[email protected]>
Sent: Thursday, June 25, 2026 10:24 AM
To: [email protected] <[email protected]>
Subject: RE: [EXT] [DISCUSS] /api/v2/monitor/health endpoint does not give 
meaningful information into Scheduler statuses

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne 
cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas 
confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le 
contenu ne présente aucun risque.



Hey,

Agreeing with Dennis here, the proposal looks good. I'd suggest extending it to 
cover the other components that can run as multiple instances in an Airflow 
environment: the dag processor and the triggerer.

The primary goal should be to support environments running multiple instances 
of the same component, regardless of whether the setup is multi-team. That 
said, in multi-team environments, it would be valuable to surface team 
ownership per component. For example, if a dag processor goes down, we should 
be able to immediately identify which team is affected.

Vincent

On 2026/06/25 16:53:13 "Ferruzzi, Dennis" wrote:
> Thanks for looking into this.   I'm a strong +1 with a caveat (see below)
>
> One alarming side effect of the way this is calculated and reported is that 
> in a multi-team environment, Team_1's scheduler may be down entirely and the 
> current dashboard will report HEALTHY as long as at least one other team is 
> live.  I'm not sure if we can squeeze this in as a bug-fix because that seems 
> like a gap we should fix.  Teghveer just confirmed last night that the same 
> calculation/reporting is being used for the Triggerers as well, so I am 
> amending your proposal; whatever we decide here will also be applied to 
> Triggerers as well. (Teghveer is willing to do that part of he work in 
> parallel once we have a consensus.)
>
> That said, here's my opinion on the proposal:
>
> Caveat:  I like the proposal, on the condition that the value reported by the 
> existing "scheduler" is unchanged.  We can (and should?) deprecate and remove 
> that in a future release with instructions to move to using "schedulers" 
> instead, but for now we can't break existing monitoring.
>
> Additional non-blocking suggestion:  Also, let's add the "team_name" in the 
> individual scheduler schema since it's available:
>
> "schedulers": {
>     "status": "DEGRADED",
>     "instances": [
>       {
>         "hostname": "scheduler-ha-instance-1",
>         "status": "HEALTHY",
>         "team_name": "team_1",
>         "latest_heartbeat": "2026-06-24T23:15:02+00:00"
>       },
>       {
>         "hostname": "scheduler-ha-instance-2",
>         "status": "DOWN",
>         "team_name": "team_2",
>         "latest_heartbeat": "2026-06-24T23:10:14+00:00"
>       },
>      etc...
> }
>
> - ferrruzi
> ________________________________
> From: Jung-Hyun Kim <[email protected]>
> Sent: Wednesday, June 24, 2026 5:01 PM
> To: [email protected] <[email protected]>
> Subject: [EXT] [DISCUSS] /api/v2/monitor/health endpoint does not give 
> meaningful information into Scheduler statuses
>
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne 
> cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas 
> confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le 
> contenu ne présente aucun risque.
>
>
>
> The Problem
> In distributed Airflow environments running multiple schedulers, the current 
> health endpoint contains a significant monitoring blind spot.
> Currently, the health check determines the status of the scheduler by 
> querying the metadata database using the most_recent_job method found in 
> job.py:
>
> @provide_session
> def most_recent_job(job_type: str, *, session: Session = NEW_SESSION) -> Job 
> | None:
>     """
>     Return the most recent job of this type, if any, based on last heartbeat 
> received.
>
>     Jobs in "running" state take precedence over others to make sure alive
>     job is returned if it is available.
>
>     :param job_type: job type to query for to get the most recent job for
>     :param session: Database session
>     :end_date: None
>     """
>     return session.scalar(
>         select(Job)
>         .where(Job.job_type == job_type)
>         .order_by(
>             # Put "running" jobs at the front.
>             case({JobState.RUNNING: 0}, value=Job.state, else_=1),
>             Job.latest_heartbeat.desc(),
>         )
>         .limit(1)
>     )
>
>
> This database query explicitly sorts records by the RUNNING state and applies 
> .limit(1), returning only a single, absolute newest job record.
> This result is then processed in airflow_health.py via the get_airflow_health 
> endpoint method:
>
> def get_airflow_health() -> dict[str, Any]:
>     """Get the health for Airflow metadatabase, scheduler and triggerer."""
>     metadatabase_status = HEALTHY
>     latest_scheduler_heartbeat = None
>     latest_triggerer_heartbeat = None
>     latest_dag_processor_heartbeat = None
>
>     scheduler_status = UNHEALTHY
>     triggerer_status: str | None = UNHEALTHY
>     dag_processor_status: str | None = UNHEALTHY
>
>     try:
>         latest_scheduler_job = SchedulerJobRunner.most_recent_job()
>
>         if latest_scheduler_job:
>             if latest_scheduler_job.latest_heartbeat:
>                 latest_scheduler_heartbeat = 
> latest_scheduler_job.latest_heartbeat.isoformat()
>             if latest_scheduler_job.is_alive():
>                 scheduler_status = HEALTHY
>     except Exception:
>         metadatabase_status = UNHEALTHY
>
>
> Because the health endpoint evaluates only the single job returned by 
> most_recent_job(), the check can only ever validate the health of one 
> scheduler at a time.
> In a distributed deployment with multiple active schedulers, if even one 
> instance is running cleanly, the endpoint will flag as healthy even if all 
> other parallel scheduler instances have gone down.
> To get meaningful information regarding the scheduler status from the health 
> endpoint it is worth it to monitor every scheduler in the distributed 
> environment instead of just a single scheduler.
> The Proposed Solution
> To deal with this problem we can add a new field called schedulers (plural 
> for multiple schedulers) in the health endpoint that returns a 3-tier 
> aggregated status that covers the following:
>
>   *
> HEALTHY: All registered scheduler instances are fully operational and 
> actively heartbeating.
>   *
> DEGRADED: At least one scheduler instance is down or failing, but at least 
> one remaining instance is still working.
>   *
> DOWN: All scheduler instances have failed or stopped working.
>
> Per-Instance Diagnostic Breakdown
> We should also add a per instance breakdown as a nested list that will show 
> the following:
>
>   1.
> hostname
>   2.
> status: Individual status
>   3.
> latest_heartbeat
>
> Example
>
> {
>   "metadatabase": {
>     "status": "healthy"
>   },
>   "scheduler": {
>     "scheduler_status": "healthy",
>     "latest_scheduler_heartbeat": "2026-06-24T23:15:02+00:00"
>   },
>   "schedulers": {
>     "status": "DEGRADED",
>     "instances": [
>       {
>         "hostname": "scheduler-ha-instance-1",
>         "status": "HEALTHY",
>         "latest_heartbeat": "2026-06-24T23:15:02+00:00"
>       },
>       {
>         "hostname": "scheduler-ha-instance-2",
>         "status": "DOWN",
>         "latest_heartbeat": "2026-06-24T23:10:14+00:00"
>       },
>       {
>         "hostname": "scheduler-ha-instance-3",
>         "status": "HEALTHY",
>         "latest_heartbeat": "2026-06-24T23:14:59+00:00"
>       }
>     ]
>   }
> }
>
> Could end up looking roughly like this, resulting in a more meaningful health 
> endpoint that will make it easier to diagnose issues with the scheduler. This 
> is a LAZY CONSENSUS proposal.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to