Yes I have identified the rename as the problem, that is why I think the extra bandwidth of the larger instances might not help. Also there is a consistency issue with S3 because of the how the rename works so that I probably lose data.
On Fri, Sep 29, 2017 at 4:42 PM, Vadim Semenov <vadim.seme...@datadoghq.com> wrote: > How many files you produce? I believe it spends a lot of time on renaming > the files because of the output committer. > Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they > have 10GbE and you can get good throughput for S3. > > On Fri, Sep 29, 2017 at 9:15 AM, Alexander Czech < > alexander.cz...@googlemail.com> wrote: > >> I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write >> parquet files to S3. But the S3 performance for various reasons is bad when >> I access s3 through the parquet write method: >> >> df.write.parquet('s3a://bucket/parquet') >> >> Now I want to setup a small cache for the parquet output. One output is >> about 12-15 GB in size. Would it be enough to setup a NFS-directory on the >> master, write the output to it and then move it to S3? Or should I setup a >> HDFS on the Master? Or should I even opt for an additional cluster running >> a HDFS solution on more than one node? >> >> thanks! >> > >