Hi Daniel,

not quite sure of this, but does Glue Data Catalogue support bucketing yet?
You might want to find that out first.


Regards,
Gourav

On Sat, Jun 15, 2019 at 1:30 PM Daniel Mateus Pires <dmate...@gmail.com>
wrote:

> Hi there!
>
> I am trying to optimize joins on data created by Spark, so I'd like to
> bucket the data to avoid shuffling.
>
> I am writing to immutable partitions every day by writing data to a local
> HDFS and then copying this data to S3, is there a combination of bucketBy
> options and DDL that I can use so that Presto/Athena JOINs leverage the
> special layout of the data?
>
> e.g.
> CREATE EXTERNAL TABLE ...(on Presto/Athena)
> df.write.bucketBy(...).partitionBy(...). (in spark)
> then copy this data to S3 with s3-dist-cp
> then MSCK REPAIR TABLE (on Presto/Athena)
>
> Daniel
>
>

Reply via email to