Hi All,

We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

The spark job running on that cluster reads from an S3 bucket and writes to
that bucket.

the bucket and the ec2 run in the same region.

As part of our efforts to reduce the runtime of our spark jobs we found
there's serious latency when reading from S3.

When the job:

   - reads the parquet files from S3 and also writes to S3, it takes 22 min
   - reads the parquet files from S3 and writes to its local hdfs, it takes
   the same amount of time (±22 min)
   - reads the parquet files from S3 (they were copied into the hdfs
   before) and writes to its local hdfs, the job took 7 min

the spark job has the following S3-related configuration:

   - spark.hadoop.fs.s3a.connection.establish.timeout=5000
   - spark.hadoop.fs.s3a.connection.maximum=200

when reading from S3 we tried to increase the
spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900
but it didn't reduce the S3 latency.

Do you have any idea for the cause of the read latency from S3?

I saw this post
<https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/>
to
improve the transfer speed, is something here relevant?


Thanks,
Tzahi

Reply via email to