[jira] [Updated] (KUDU-3606) Copy historical data between different Kudu clusters though files

Yingchun Lai (Jira) Tue, 06 Aug 2024 07:45:36 -0700


     [ 
https://issues.apache.org/jira/browse/KUDU-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yingchun Lai updated KUDU-3606:
-------------------------------
    Description: 
h1. Motivation

There are usage cases that we need to copy Kudu data from one cluster to 
another.

For example, we deployed some small scale of Kudu clusters, they clusters are 
managed and used by applications separately. The cost and management is not 
easy if the number of small clusters grows highly.

It's an alternative choice to organize the data and applications to a bigger 
Kudu cluster, the hardware resource can be shared and the cost may be reduced, 
the access of applications can be unified, and the security can be ensured by 
fine-grained authorization.

If a maintainance time window is allowed, we can use `kudu table copy` or 
Impala/Spark SQL `insert into ... select * from ...` to copy data from one 
cluster to another. But it would be very slow, easy to timeout, and cause high 
CPU, disk, network load if the data size is huge.
h1. Solution

We can draw on the experience of tablet copy approach internal Kudu cluster. 
The disk rowset blocks and WAL segments are read, transfered and wrote though 
files.

There is no row-by-row reading and writing operations, the efficiency would be 
much higher than `table copy` or similar approachs.

  was:
h1. Motivation

There are usage cases that we need to copy Kudu data from one cluster to 
another.

For example, we deployed some small scale of Kudu clusters, they clusters are 
managed and used by applications separately. The cost and management is not 
easy if the number of small clusters grows highly.

It's an alternative choice to organize the data and applications to a bigger 
Kudu cluster, the hardware resource can be shared and the cost may be reduced, 
the access of applications can be unified, and the security can be ensured by 
fine-grained authorization.

If a maintainance time window is allowed, we can use `kudu table copy` or 
Impala/Spark SQL `insert into ... select * from ...` to copy data from one 
cluster to another. But it would be very slow, easy to timeout, and cause high 
CPU, disk, network load if the data size is huge.

 


> Copy historical data between different Kudu clusters though files
> -----------------------------------------------------------------
>
>                 Key: KUDU-3606
>                 URL: https://issues.apache.org/jira/browse/KUDU-3606
>             Project: Kudu
>          Issue Type: Improvement
>          Components: tablet copy
>            Reporter: Yingchun Lai
>            Priority: Minor
>
> h1. Motivation
> There are usage cases that we need to copy Kudu data from one cluster to 
> another.
> For example, we deployed some small scale of Kudu clusters, they clusters are 
> managed and used by applications separately. The cost and management is not 
> easy if the number of small clusters grows highly.
> It's an alternative choice to organize the data and applications to a bigger 
> Kudu cluster, the hardware resource can be shared and the cost may be 
> reduced, the access of applications can be unified, and the security can be 
> ensured by fine-grained authorization.
> If a maintainance time window is allowed, we can use `kudu table copy` or 
> Impala/Spark SQL `insert into ... select * from ...` to copy data from one 
> cluster to another. But it would be very slow, easy to timeout, and cause 
> high CPU, disk, network load if the data size is huge.
> h1. Solution
> We can draw on the experience of tablet copy approach internal Kudu cluster. 
> The disk rowset blocks and WAL segments are read, transfered and wrote though 
> files.
> There is no row-by-row reading and writing operations, the efficiency would 
> be much higher than `table copy` or similar approachs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KUDU-3606) Copy historical data between different Kudu clusters though files

Reply via email to