Hey,

Agreeing with Dennis here, the proposal looks good. I'd suggest extending it to 
cover the other components that can run as multiple instances in an Airflow 
environment: the dag processor and the triggerer.

The primary goal should be to support environments running multiple instances 
of the same component, regardless of whether the setup is multi-team. That 
said, in multi-team environments, it would be valuable to surface team 
ownership per component. For example, if a dag processor goes down, we should 
be able to immediately identify which team is affected.

Vincent

On 2026/06/25 16:53:13 "Ferruzzi, Dennis" wrote:
> Thanks for looking into this.   I'm a strong +1 with a caveat (see below)
> 
> One alarming side effect of the way this is calculated and reported is that 
> in a multi-team environment, Team_1's scheduler may be down entirely and the 
> current dashboard will report HEALTHY as long as at least one other team is 
> live.  I'm not sure if we can squeeze this in as a bug-fix because that seems 
> like a gap we should fix.  Teghveer just confirmed last night that the same 
> calculation/reporting is being used for the Triggerers as well, so I am 
> amending your proposal; whatever we decide here will also be applied to 
> Triggerers as well. (Teghveer is willing to do that part of he work in 
> parallel once we have a consensus.)
> 
> That said, here's my opinion on the proposal:
> 
> Caveat:  I like the proposal, on the condition that the value reported by the 
> existing "scheduler" is unchanged.  We can (and should?) deprecate and remove 
> that in a future release with instructions to move to using "schedulers" 
> instead, but for now we can't break existing monitoring.
> 
> Additional non-blocking suggestion:  Also, let's add the "team_name" in the 
> individual scheduler schema since it's available:
> 
> "schedulers": {
>     "status": "DEGRADED",
>     "instances": [
>       {
>         "hostname": "scheduler-ha-instance-1",
>         "status": "HEALTHY",
>         "team_name": "team_1",
>         "latest_heartbeat": "2026-06-24T23:15:02+00:00"
>       },
>       {
>         "hostname": "scheduler-ha-instance-2",
>         "status": "DOWN",
>         "team_name": "team_2",
>         "latest_heartbeat": "2026-06-24T23:10:14+00:00"
>       },
>      etc...
> }
> 
> - ferrruzi
> ________________________________
> From: Jung-Hyun Kim <[email protected]>
> Sent: Wednesday, June 24, 2026 5:01 PM
> To: [email protected] <[email protected]>
> Subject: [EXT] [DISCUSS] /api/v2/monitor/health endpoint does not give 
> meaningful information into Scheduler statuses
> 
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne 
> cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas 
> confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le 
> contenu ne présente aucun risque.
> 
> 
> 
> The Problem
> In distributed Airflow environments running multiple schedulers, the current 
> health endpoint contains a significant monitoring blind spot.
> Currently, the health check determines the status of the scheduler by 
> querying the metadata database using the most_recent_job method found in 
> job.py:
> 
> @provide_session
> def most_recent_job(job_type: str, *, session: Session = NEW_SESSION) -> Job 
> | None:
>     """
>     Return the most recent job of this type, if any, based on last heartbeat 
> received.
> 
>     Jobs in "running" state take precedence over others to make sure alive
>     job is returned if it is available.
> 
>     :param job_type: job type to query for to get the most recent job for
>     :param session: Database session
>     :end_date: None
>     """
>     return session.scalar(
>         select(Job)
>         .where(Job.job_type == job_type)
>         .order_by(
>             # Put "running" jobs at the front.
>             case({JobState.RUNNING: 0}, value=Job.state, else_=1),
>             Job.latest_heartbeat.desc(),
>         )
>         .limit(1)
>     )
> 
> 
> This database query explicitly sorts records by the RUNNING state and applies 
> .limit(1), returning only a single, absolute newest job record.
> This result is then processed in airflow_health.py via the get_airflow_health 
> endpoint method:
> 
> def get_airflow_health() -> dict[str, Any]:
>     """Get the health for Airflow metadatabase, scheduler and triggerer."""
>     metadatabase_status = HEALTHY
>     latest_scheduler_heartbeat = None
>     latest_triggerer_heartbeat = None
>     latest_dag_processor_heartbeat = None
> 
>     scheduler_status = UNHEALTHY
>     triggerer_status: str | None = UNHEALTHY
>     dag_processor_status: str | None = UNHEALTHY
> 
>     try:
>         latest_scheduler_job = SchedulerJobRunner.most_recent_job()
> 
>         if latest_scheduler_job:
>             if latest_scheduler_job.latest_heartbeat:
>                 latest_scheduler_heartbeat = 
> latest_scheduler_job.latest_heartbeat.isoformat()
>             if latest_scheduler_job.is_alive():
>                 scheduler_status = HEALTHY
>     except Exception:
>         metadatabase_status = UNHEALTHY
> 
> 
> Because the health endpoint evaluates only the single job returned by 
> most_recent_job(), the check can only ever validate the health of one 
> scheduler at a time.
> In a distributed deployment with multiple active schedulers, if even one 
> instance is running cleanly, the endpoint will flag as healthy even if all 
> other parallel scheduler instances have gone down.
> To get meaningful information regarding the scheduler status from the health 
> endpoint it is worth it to monitor every scheduler in the distributed 
> environment instead of just a single scheduler.
> The Proposed Solution
> To deal with this problem we can add a new field called schedulers (plural 
> for multiple schedulers) in the health endpoint that returns a 3-tier 
> aggregated status that covers the following:
> 
>   *
> HEALTHY: All registered scheduler instances are fully operational and 
> actively heartbeating.
>   *
> DEGRADED: At least one scheduler instance is down or failing, but at least 
> one remaining instance is still working.
>   *
> DOWN: All scheduler instances have failed or stopped working.
> 
> Per-Instance Diagnostic Breakdown
> We should also add a per instance breakdown as a nested list that will show 
> the following:
> 
>   1.
> hostname
>   2.
> status: Individual status
>   3.
> latest_heartbeat
> 
> Example
> 
> {
>   "metadatabase": {
>     "status": "healthy"
>   },
>   "scheduler": {
>     "scheduler_status": "healthy",
>     "latest_scheduler_heartbeat": "2026-06-24T23:15:02+00:00"
>   },
>   "schedulers": {
>     "status": "DEGRADED",
>     "instances": [
>       {
>         "hostname": "scheduler-ha-instance-1",
>         "status": "HEALTHY",
>         "latest_heartbeat": "2026-06-24T23:15:02+00:00"
>       },
>       {
>         "hostname": "scheduler-ha-instance-2",
>         "status": "DOWN",
>         "latest_heartbeat": "2026-06-24T23:10:14+00:00"
>       },
>       {
>         "hostname": "scheduler-ha-instance-3",
>         "status": "HEALTHY",
>         "latest_heartbeat": "2026-06-24T23:14:59+00:00"
>       }
>     ]
>   }
> }
> 
> Could end up looking roughly like this, resulting in a more meaningful health 
> endpoint that will make it easier to diagnose issues with the scheduler. This 
> is a LAZY CONSENSUS proposal.
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to