Re: [APE] Support COPY TO

Mike Carey Wed, 01 Nov 2023 20:57:07 -0700

Excellent!

On Wed, Nov 1, 2023 at 2:27 PM Wail Alkowaileet <wael....@gmail.com> wrote:


> AsterixDB now supports COPY TO.
>
> On Thu, Oct 26, 2023 at 10:05 AM Till Westmann <ti...@apache.org> wrote:
>
> > Agreed (+1) - this functionality will significantly reduce the pain of
> > moving data in and out of object storage.
> >
> > > On Oct 25, 2023, at 10:04 PM, Mike Carey <dtab...@gmail.com> wrote:
> > >
> > > +1  -- Now maybe users will stop trying to retrieve huge results and
> > wondering why the UI is choking! :-) This capability is actually long
> > overdue.
> > >
> > > On 10/24/23 9:53 AM, Wail Alkowaileet wrote:
> > >> Currently, AsterixDB does not have a clean way to extract query
> results
> > or
> > >> dump a dataset to a storage device. The only channel provided
> currently
> > is
> > >> the Query Service (i.e., running the query and writing it somehow at
> the
> > >> client side). We need to support a way to write query results (or
> dump a
> > >> dataset) in parallel to a storage device.
> > >>
> > >> To illustrate, say we want to do the following:
> > >>
> > >>> USE CopyToDataverse;
> > >> COPY ColumnDataset
> > >>> TO localfs
> > >>> PATH("localhost:///media/backup/CopyToResult")
> > >>> WITH {
> > >>>     "format" : "json"
> > >>> };
> > >> In this example, the data in ColumnDataset will be written in each
> node
> > at
> > >> the provided path localhost:///media/backup/CopyToResult. Simply, each
> > node
> > >> will write its own partitions for the data stored in ColumnDataset
> > locally.
> > >> The written files will be in raw JSON format.
> > >>
> > >> Another example:
> > >>
> > >>> USE CopyToDataverse;
> > >>> COPY (SELECT cd.uid uid,
> > >>>                             cd.sensor_info.name name,
> > >>>                             to_bigint(cd.sensor_info.battery_status)
> > >>> battery_status
> > >>>              FROM ColumnDataset cd
> > >>> ) toWrite
> > >>> TO s3
> > >>> PATH("CopyToResult/" || to_string(b))
> > >>> OVER (
> > >>>    PARTITION BY toWrite.battery_status b
> > >>>    ORDER BY toWrite.name
> > >>> )
> > >>> WITH {
> > >>>     "format" : "json",
> > >>>     "compression": "gzip",
> > >>>     "max-objects-per-file": 100,
> > >>>     "container": "myBucket",
> > >>>     "accessKeyId": "<access-key>",
> > >>>     "secretAccessKey": "<secret-key>",
> > >>>     "region": "us-west-2"
> > >>> };
> > >> The second example shows how to write the result of a query and also
> > >> partition the result so that each partition will be written to a
> certain
> > >> path. In this example, we partition by the battery_status (say an
> > integer
> > >> value from 0 to 100). The final result will be written to myBucke in
> > Amazon
> > >> S3.
> > >>
> > >> Each partition will have the path CopyToResult/<battery_status>. For
> > >> example CopyToResult/0, CopyToResult/1 ..., CopyToResult/99,
> > >> CopyToResult/100). This partitioning scheme can be useful if a user
> > wants
> > >> to exploit dynamic prefixes (external filters) (see ASTERIXDB-3073
> > >> <https://issues.apache.org/jira/browse/ASTERIXDB-3073>).
> > >>
> > >> Additionally, the records in each partition will be ordered by the
> > >> sensor_name (toWrite.name). Note that this ordering isn't global but
> per
> > >> partition.
> > >>
> > >> Also, the written files will be compressed using *gzip* and each file
> > >> should have at most 100 records max (*max-objects-per-file*).
> > >>
> > >> EPIC: ASTERIXDB-3286<
> > https://issues.apache.org/jira/browse/ASTERIXDB-3286>
> >
> >
>
> --
>
> *Regards,*
> Wail Alkowaileet
>

Re: [APE] Support COPY TO

Reply via email to