Re: [APE] Support COPY TO

Ian Maxon Tue, 24 Oct 2023 11:50:52 -0700

 +1, this will be great. Long ago we had WRITE TO but it was very limited
and troublesome. This is a great improvement over that older facility.


On Oct 24, 2023 at 09:53:40, Wail Alkowaileet <[email protected]> wrote:

> Currently, AsterixDB does not have a clean way to extract query results or
> dump a dataset to a storage device. The only channel provided currently is
> the Query Service (i.e., running the query and writing it somehow at the
> client side). We need to support a way to write query results (or dump a
> dataset) in parallel to a storage device.
>
> To illustrate, say we want to do the following:
>
> USE CopyToDataverse;
>
>
> COPY ColumnDataset
>
> TO localfs
>
> PATH("localhost:///media/backup/CopyToResult")
>
> WITH {
>
>     "format" : "json"
>
> };
>
>
> In this example, the data in ColumnDataset will be written in each node at
> the provided path localhost:///media/backup/CopyToResult. Simply, each node
> will write its own partitions for the data stored in ColumnDataset locally.
> The written files will be in raw JSON format.
>
> Another example:
>
> USE CopyToDataverse;
>
> COPY (SELECT cd.uid uid,
>
>                             cd.sensor_info.name name,
>
>                             to_bigint(cd.sensor_info.battery_status)
>
> battery_status
>
>              FROM ColumnDataset cd
>
> ) toWrite
>
> TO s3
>
> PATH("CopyToResult/" || to_string(b))
>
> OVER (
>
>    PARTITION BY toWrite.battery_status b
>
>    ORDER BY toWrite.name
>
> )
>
> WITH {
>
>     "format" : "json",
>
>     "compression": "gzip",
>
>     "max-objects-per-file": 100,
>
>     "container": "myBucket",
>
>     "accessKeyId": "<access-key>",
>
>     "secretAccessKey": "<secret-key>",
>
>     "region": "us-west-2"
>
> };
>
>
> The second example shows how to write the result of a query and also
> partition the result so that each partition will be written to a certain
> path. In this example, we partition by the battery_status (say an integer
> value from 0 to 100). The final result will be written to myBucke in Amazon
> S3.
>
> Each partition will have the path CopyToResult/<battery_status>. For
> example CopyToResult/0, CopyToResult/1 ..., CopyToResult/99,
> CopyToResult/100). This partitioning scheme can be useful if a user wants
> to exploit dynamic prefixes (external filters) (see ASTERIXDB-3073
> <
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/ASTERIXDB-3073__;!!CzAuKJ42GuquVTTmVmPViYEvSg!JX_fnvdEqKOBNcNltYJtuNHqGGHc4fyOkgw7JbMQu97xtsGlYfO-FZYaApn8kneml74lriyZ0Vdm4QA$
> >).
>
> Additionally, the records in each partition will be ordered by the
> sensor_name (toWrite.name). Note that this ordering isn't global but per
> partition.
>
> Also, the written files will be compressed using *gzip* and each file
> should have at most 100 records max (*max-objects-per-file*).
>
> EPIC: ASTERIXDB-3286 <
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/ASTERIXDB-3286__;!!CzAuKJ42GuquVTTmVmPViYEvSg!JX_fnvdEqKOBNcNltYJtuNHqGGHc4fyOkgw7JbMQu97xtsGlYfO-FZYaApn8kneml74lriyZQEc5iKM$
> >
> --
>
> *Regards,*
> Wail Alkowaileet
>

Re: [APE] Support COPY TO

Reply via email to