[
https://issues.apache.org/jira/browse/IMPALA-10476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenzhe Zhou reassigned IMPALA-10476:
------------------------------------
Assignee: (was: Wenzhe Zhou)
> Remove executor node with faulty disks from executor group
> ----------------------------------------------------------
>
> Key: IMPALA-10476
> URL: https://issues.apache.org/jira/browse/IMPALA-10476
> Project: IMPALA
> Issue Type: Sub-task
> Components: Distributed Exec
> Reporter: Wenzhe Zhou
> Priority: Major
>
> If an executor node frequently gets disk IO failures when reading/writing
> local disk, it should report its unhealthy state to statestore so that the
> node could be marked as down and be removed from executor group to avoid
> repeated query failures in the cluster. This provides a mechanism for
> executor node to remove itself from scheduling.
> The two major components of Impala that read/write from local disk are the
> spill-to-disk and data caching features. We need to add stats for counting
> such local disk failures over a period of time like last x seconds, then use
> these stats to measure if a node is in good health for executing query
> fragment instances.
> The healthy state of an executor node should be shown on the debug WebUI. We
> should also allow users to overwrite the node's healthy state. The node will
> restart to register itself in the statestore once its healthy state is
> overwritten.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]