Xiaoyu Yao created HDFS-9723:
--------------------------------
Summary: Improve Namenode Throttling Against Bad Jobs with FCQ and
CallerContext
Key: HDFS-9723
URL: https://issues.apache.org/jira/browse/HDFS-9723
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: Xiaoyu Yao
Assignee: Xiaoyu Yao
HDFS namenode handles RPC requests from DFS clients and internal processing
from datanodes. It has been a recurring pain that some bad jobs overwhelm the
namenode and bring the whole cluster down. FCQ (Fair Call Queue) by HADOOP-9640
is the one of the existing efforts added since Hadoop 2.4 to address this
issue.
In current FCQ implementation, incoming RPC calls are scheduled based on the
number of recent RPC calls (1000) of different users with a time-decayed
scheduler. This works well when there is a clear mapping between users and
their RPC calls from different jobs. However, this may not work effectively
when it is hard to track calls to a specific caller in a chain of operations
from the workflow (e.g.Oozie -> Hive -> Yarn). It is not feasible for
operators/administrators to throttle all the hive jobs because of one “bad”
query.
This JIRA proposed to leverage RPC caller context information (such as
callerType: caller Id from TEZ-2851) available with HDFS-9184 as an alternative
to existing UGI (or user name when delegation token is not available) based
Identify Provider to improve effectiveness Hadoop RPC Fair Call Queue
(HADOOP-9640) for better namenode throttling in multi-tenancy cluster
deployment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)