[ https://issues.apache.org/jira/browse/FLINK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jacob Jona Fahlenkamp updated FLINK-36553: ------------------------------------------ Description: This would be useful for the same reasons as in Table API. AggregateFunctions already have a merge method, which could be used to merge the local aggregates. Batch jobs where you read a large amount of raw data -> key by -> aggregate currently write all the raw data to disk before the shuffle. This makes it infeasible to run a batch job over a large span of data, because of large disk requirements. If we had local global-aggregation the data could be preaggregated already, before being written to disk. For stream jobs it would help with network throughput and data skew. was: This would be useful for the same reasons as in Table API. AggregateFunctions already have a merge method, which could be used to merge the local aggregates. Batch jobs where you read a large amount of raw data -> key by -> aggregate currently write all the raw data to disk. This makes it infeasible to run a batch job over a large span of data, because of large disk requirements. If we had local global-aggregation the data could be preaggregated already, before being written to disk. For stream jobs it would help with network throughput and data skew. > Local Global aggregation in Datasream API > ----------------------------------------- > > Key: FLINK-36553 > URL: https://issues.apache.org/jira/browse/FLINK-36553 > Project: Flink > Issue Type: New Feature > Components: API / DataStream > Reporter: Jacob Jona Fahlenkamp > Priority: Major > > This would be useful for the same reasons as in Table API. AggregateFunctions > already have a merge method, which could be used to merge the local > aggregates. > Batch jobs where you read a large amount of raw data -> key by -> aggregate > currently write all the raw data to disk before the shuffle. This makes it > infeasible to run a batch job over a large span of data, because of large > disk requirements. If we had local global-aggregation the data could be > preaggregated already, before being written to disk. > For stream jobs it would help with network throughput and data skew. -- This message was sent by Atlassian Jira (v8.20.10#820010)