[ https://issues.apache.org/jira/browse/FLINK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Timo Walther closed FLINK-23402. -------------------------------- Release Note: The default DataStream API shuffle mode for batch executions has been changed to blocking exchanges for all edges of the stream graph. A new option `execution.shuffle-mode` allows to change it to pipelined behavior if necessary. Resolution: Fixed Fixed in 1.14.0: commit a78f34a735c4619cfef882f9b9a2057c507a4bca [streaming-java][table-planner] Add ShuffleMode option commit 0139222030d5e3dac2b9ffe7200c758ab6153fff [streaming-java] Default to GlobalStreamExchangeMode.ALL_EDGES_BLOCKING in batch mode commit 313718466d15b473bd5bf1dcf0d9d988e0fd5979 [streaming-java] Mark GlobalStreamExchangeMode as @Internal commit 156f517d387202ac292bde5bfac423a23908b7a2 [streaming-java] Refactor GlobalDataExchangeMode to GlobalStreamExchangeMode commit 86f54c89c7866647e50d3957026bd0d28869ea8d [streaming-java] Fix minor code issues around 'shuffle mode' commit 4e65322dc1b5f80a7f3a42f0f205f978357daa40 [streaming-java] Refactor ShuffleMode to StreamExchangeMode > Expose a consistent GlobalDataExchangeMode > ------------------------------------------ > > Key: FLINK-23402 > URL: https://issues.apache.org/jira/browse/FLINK-23402 > Project: Flink > Issue Type: Sub-task > Components: API / DataStream > Reporter: Timo Walther > Assignee: Timo Walther > Priority: Major > Labels: pull-request-available > > The Table API makes the {{GlobalDataExchangeMode}} configurable via > {{table.exec.shuffle-mode}}. > In Table API batch mode the StreamGraph is configured with > {{ALL_EDGES_BLOCKING}} and in DataStream API batch mode > {{FORWARD_EDGES_PIPELINED}}. > I would vote for unifying the exchange mode of both APIs so that complex SQL > pipelines behave identical in {{StreamTableEnvironment}} and > {{TableEnvironment}}. Also the feedback a got so far would make > {{ALL_EDGES_BLOCKING}} a safer option to run pipelines successfully with > limited resources. > [~lzljs3620320] > {quote} > The previous history was like this: > - The default value is pipeline, and we find that many times due to > insufficient resources, the deployment will hang. And the typical use of > batch jobs is small resources running large parallelisms, because in batch > jobs, the granularity of failover is related to the amount of data processed > by a single task. The smaller the amount of data, the faster the fault > tolerance. So most of the scenarios are run with small resources and large > parallelisms, little by little slowly running. > - Later, we switched the default value to blocking. We found that the better > blocking shuffle implementation would not slow down the running speed much. > We tested tpc-ds and it took almost the same time. > {quote} > [~dwysakowicz] > {quote} > I don't see a problem with changing the default value for DataStream batch > mode if you think ALL_EDGES_BLOCKING is the better default option. > {quote} > In any case, we should make this configurable for DataStream API users and > make the specific Table API option obsolete. -- This message was sent by Atlassian Jira (v8.3.4#803005)