Hi everyone,

Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I
manage to process approx 1TB of strings inside gzipped parquet in about 50
mins on a 20 node cluster (8 cores, 60Gb ram). That's about 17MBytes/sec
per node.

This seems sub optimal.

The processing is very basic, simple fields extraction from the strings and
a groupBy.

Watching Aaron's talk from the Spark EU Summit:
https://youtu.be/GzG9RTRTFck?t=863

it seems I am hitting the same issues with suboptimal S3 throughput he
mentions there.

I tried different numbers of files for the input data set (more smaller
files vs less larger files) combined with various settings
for fs.s3n.block.size thinking that might help if each mapper streams
larger chunks. It didn't! It actually seems that many small files gives
better performance than less larger ones (of course with oversubscribed
number of tasks/threads).

Similarly to what Aaron is mentioning with oversubscribed tasks/threads we
also become CPU bound (reach 100% cpu utilisation).


Has anyone seen a similar behaviour? How can we optimise this?

Are the improvements mentioned in Aaron's talk now part of S3n or S3a
driver or are they just available under DataBricksCloud? How can we benefit
from those improvements?

Thanks,
Martin

P.S. Have not tried S3a.

Reply via email to