Jin Xing created FLINK-22672:
--------------------------------
Summary: Some enhancements for pluggable shuffle service framework
Key: FLINK-22672
URL: https://issues.apache.org/jira/browse/FLINK-22672
Project: Flink
Issue Type: Improvement
Components: Runtime / Network
Reporter: Jin Xing
"Pluggable shuffle service" in Flink provides an architecture which are unified
for both streaming and batch jobs, allowing user to customize the process of
data transfer between shuffle stages according to scenarios.
There are already a number of implementations of "remote shuffle service" on
Spark like [1][2][3]. Remote shuffle enables to shuffle data from/to a remote
cluster and achieves benefits like :
# The lifecycle of computing resource can be decoupled with shuffle data, once
computing task is finished, idle computing nodes can be released with its
completed shuffle data accormadated on remote shuffle cluster.
# There is no need to reserve disk capacity for shuffle on computing nodes.
Remote shuffle cluster serves shuffling request with better scaling ability and
alleviates the local disk pressure on computing nodes when data skew.
Based "pluggable shuffle service", we build our own "remote shuffle service" on
Flink -- Lattice, which targets to provide functionalities and improve
performance for batch processing jobs. Basically it works as below:
# Lattice cluster works as an independent service for shuffling request;
# LatticeShuffleMaster extends ShuffleMaster, works inside JM and talks with
remote Lattice cluster for shuffle resouce application and shuffle data
lifecycle management;
# LatticeShuffleEnvironmente extends ShuffleEnvironment, works inside TM and
provides an environment for shuffling data from/to remote Lattice cluster;
During the process of building Lattice we find some potential enhancements on
"pluggable shuffle service". I will enumerate and create some sub JIRAs under
this umbrella
[1]
[https://www.alibabacloud.com/blog/emr-remote-shuffle-service-a-powerful-elastic-tool-of-serverless-spark_597728]
[2] [https://bestoreo.github.io/post/cosco/cosco/]
[3] [https://github.com/uber/RemoteShuffleService]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)