On 29 Sep 2017, at 15:59, Alexander Czech 
<alexander.cz...@googlemail.com<mailto:alexander.cz...@googlemail.com>> wrote:

Yes I have identified the rename as the problem, that is why I think the extra 
bandwidth of the larger instances might not help. Also there is a consistency 
issue with S3 because of the how the rename works so that I probably lose data.

correct

rename is mimicked with a COPY + DELETE; copy is in S3 and your bandwidth 
appears to be 6-10 MB/s


On Fri, Sep 29, 2017 at 4:42 PM, Vadim Semenov 
<vadim.seme...@datadoghq.com<mailto:vadim.seme...@datadoghq.com>> wrote:
How many files you produce? I believe it spends a lot of time on renaming the 
files because of the output committer.
Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they have 
10GbE and you can get good throughput for S3.

On Fri, Sep 29, 2017 at 9:15 AM, Alexander Czech 
<alexander.cz...@googlemail.com<mailto:alexander.cz...@googlemail.com>> wrote:
I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write parquet 
files to S3. But the S3 performance for various reasons is bad when I access s3 
through the parquet write method:

df.write.parquet('s3a://bucket/parquet')

Now I want to setup a small cache for the parquet output. One output is about 
12-15 GB in size. Would it be enough to setup a NFS-directory on the master, 
write the output to it and then move it to S3? Or should I setup a HDFS on the 
Master? Or should I even opt for an additional cluster running a HDFS solution 
on more than one node?

thanks!



Reply via email to