Hi everyone, Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I manage to process approx 1TB of strings inside gzipped parquet in about 50 mins on a 20 node cluster (8 cores, 60Gb ram). That's about 17MBytes/sec per node.
This seems sub optimal. The processing is very basic, simple fields extraction from the strings and a groupBy. Watching Aaron's talk from the Spark EU Summit: https://youtu.be/GzG9RTRTFck?t=863 it seems I am hitting the same issues with suboptimal S3 throughput he mentions there. I tried different numbers of files for the input data set (more smaller files vs less larger files) combined with various settings for fs.s3n.block.size thinking that might help if each mapper streams larger chunks. It didn't! It actually seems that many small files gives better performance than less larger ones (of course with oversubscribed number of tasks/threads). Similarly to what Aaron is mentioning with oversubscribed tasks/threads we also become CPU bound (reach 100% cpu utilisation). Has anyone seen a similar behaviour? How can we optimise this? Are the improvements mentioned in Aaron's talk now part of S3n or S3a driver or are they just available under DataBricksCloud? How can we benefit from those improvements? Thanks, Martin P.S. Have not tried S3a.