Hi Daniel, not quite sure of this, but does Glue Data Catalogue support bucketing yet? You might want to find that out first.
Regards, Gourav On Sat, Jun 15, 2019 at 1:30 PM Daniel Mateus Pires <dmate...@gmail.com> wrote: > Hi there! > > I am trying to optimize joins on data created by Spark, so I'd like to > bucket the data to avoid shuffling. > > I am writing to immutable partitions every day by writing data to a local > HDFS and then copying this data to S3, is there a combination of bucketBy > options and DDL that I can use so that Presto/Athena JOINs leverage the > special layout of the data? > > e.g. > CREATE EXTERNAL TABLE ...(on Presto/Athena) > df.write.bucketBy(...).partitionBy(...). (in spark) > then copy this data to S3 with s3-dist-cp > then MSCK REPAIR TABLE (on Presto/Athena) > > Daniel > >