[jira] [Updated] (FLINK-27696) Add bin-pack strategy to split the whole bucket data files into several small splits

Yu Li (Jira) Wed, 24 Aug 2022 19:31:13 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-27696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yu Li updated FLINK-27696:
--------------------------
    Component/s: Table Store

> Add bin-pack strategy to split the whole bucket data files into several small 
> splits
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-27696
>                 URL: https://issues.apache.org/jira/browse/FLINK-27696
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Table Store
>            Reporter: Zheng Hu
>            Assignee: Jingsong Lee
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: table-store-0.2.0
>
>
> We don't have to assign each task with a whole bucket data files. Instead, we 
> can use some algorithm ( such as bin-packing) to split the whole bucket data 
> files into multiple fragments to improve the job parallelism.
> For merge tree table:
> Suppose now there are files: [1, 2] [3, 4] [5, 180] [5, 190] [200, 600] [210, 
> 700]
> Files without intersection are not related, we do not need to put all files 
> into one split, we can slice into multiple splits, multiple parallelism 
> execution is faster. Nor can we slice too fine, we should make each split as 
> large as possible with 128 MB, so use BinPack to slice, the final result will 
> be:
>  * split1: [1, 2] [3, 4]
>  * split2: [5, 180] [5, 190]
>  * split3: [200, 600] [210, 700]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-27696) Add bin-pack strategy to split the whole bucket data files into several small splits

Reply via email to