[jira] [Commented] (KUDU-3606) Copy historical data between different Kudu clusters though files

ASF subversion and git services (Jira) Fri, 13 Sep 2024 05:46:02 -0700


    [ 
https://issues.apache.org/jira/browse/KUDU-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881546#comment-17881546
 ]


ASF subversion and git services commented on KUDU-3606:
-------------------------------------------------------

Commit e5d11b2c9c09688ea1a260aa1c8f5a25e23d8b12 in kudu's branch 
refs/heads/master from Yingchun Lai
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=e5d11b2c9 ]

KUDU-3606 [tools] Show column ids when using 'kudu table describe'

Column id is an internal concept, this patch
adds a new flag --show_column_id to indicate
whether to show it when using 'kudu table describe'.

It's helpful to compare whether two tables'
schema are competely the same with considering
the internal column id as well.

Change-Id: Ib9b28bd8f879bb6cace1683bc366c72772caebdc
Reviewed-on: http://gerrit.cloudera.org:8080/21717
Reviewed-by: Yifan Zhang <chinazhangyi...@163.com>
Reviewed-by: Wang Xixu <1450306...@qq.com>
Tested-by: Yingchun Lai <laiyingc...@apache.org>
Reviewed-by: Marton Greber <greber...@gmail.com>


> Copy historical data between different Kudu clusters though files
> -----------------------------------------------------------------
>
>                 Key: KUDU-3606
>                 URL: https://issues.apache.org/jira/browse/KUDU-3606
>             Project: Kudu
>          Issue Type: Improvement
>          Components: tablet copy
>            Reporter: Yingchun Lai
>            Priority: Minor
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> h1. Motivation
> There are usage cases that we need to copy Kudu data from one cluster to 
> another.
> For example, we deployed some small scale of Kudu clusters, they clusters are 
> managed and used by applications separately. The cost and management is not 
> easy if the number of small clusters grows highly.
> It's an alternative choice to organize the data and applications to a bigger 
> Kudu cluster, the hardware resource can be shared and the cost may be 
> reduced, the access of applications can be unified, and the security can be 
> ensured by fine-grained authorization.
> If a maintainance time window is allowed, we can use `kudu table copy` or 
> Impala/Spark SQL `insert into ... select * from ...` to copy data from one 
> cluster to another. But it would be very slow, easy to timeout, and cause 
> high CPU, disk, network load if the data size is huge.
> h1. Solution
> We can draw on the experience of tablet copy approach internal Kudu cluster. 
> The disk rowset blocks and WAL segments are read, transfered and wrote though 
> files.
> There is no row-by-row reading and writing operations, the efficiency would 
> be much higher than `table copy` or similar approachs.
> h1. Limitation
>  * It's needed to make a maintainance time window to stop the write 
> operations to the source table.
>  * The destination table must has a same schema to the source table. It can 
> be achieved by using `kudu table copy  ... -write_type=""`.
>  * The destination table must has only 1 replica factor, by specify 
> create_table_replication_factor=1. This is necessary to ensure data 
> consistent. We can change the replica factor by using `kudu table 
> set_replication_factor` after all data has been copied completely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KUDU-3606) Copy historical data between different Kudu clusters though files

Reply via email to