RE: Spark performance over S3

Boris Litvak Wed, 07 Apr 2021 07:25:56 -0700

Oh, Tzahi, I misread the metrics in the first reply. It’s about reads indeed, 
not writes.

From: Tzahi File <tzahi.f...@ironsrc.com>
Sent: Wednesday, 7 April 2021 16:02
To: Hariharan <hariharan...@gmail.com>
Cc: user <user@spark.apache.org>
Subject: Re: Spark performance over S3

Hi Hariharan,

Thanks for your reply.

In both cases we are writing the data to S3. The difference is that in the 
first case we read the data from S3 and in the second we read from HDFS.
We are using ListObjectsV2 API in 
S3A<https://issues.apache.org/jira/browse/HADOOP-13421>.

The S3 bucket and the cluster are located at the same AWS region.

On Wed, Apr 7, 2021 at 2:12 PM Hariharan 
<hariharan...@gmail.com<mailto:hariharan...@gmail.com>> wrote:
Hi Tzahi,

Comparing the first two cases:

  *   > reads the parquet files from S3 and also writes to S3, it takes 22 min
  *   > reads the parquet files from S3 and writes to its local hdfs, it takes 
the same amount of time (±22 min)

It looks like most of the time is being spent in reading, and the time spent in 
writing is likely negligible (probably you're not writing much output?)

Can you clarify what is the difference between these two?

> reads the parquet files from S3 and writes to its local hdfs, it takes the 
> same amount of time (±22 min)?
> reads the parquet files from S3 (they were copied into the hdfs before) and 
> writes to its local hdfs, the job took 7 min

In the second case, was the data read from hdfs or s3?

Regarding the point from the post you linked to:
1, Enhanced networking does make a 
difference<https://laptrinhx.com/hadoop-with-enhanced-networking-on-aws-1893465489/>,
 but it should be automatically enabled if you're using a compatible instance 
type and an AWS AMI. However if you're using a custom AMI, you might want to 
check if it's enabled for you.
2. VPC endpoints also can make a difference in performance - at least that used 
to be the case a few years ago. Maybe that has changed now.

Couple of other things you might want to check:
1. If your bucket is versioned, you may want to check if you're using the 
ListObjectsV2 API in S3A<https://issues.apache.org/jira/browse/HADOOP-13421>.
2. Also check these recommendations from 
Cloudera<https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/s3-performance.html>
 for optimal use of S3A.

Thanks,
Hariharan

On Wed, Apr 7, 2021 at 12:15 AM Tzahi File 
<tzahi.f...@ironsrc.com<mailto:tzahi.f...@ironsrc.com>> wrote:

Hi All,

We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

The spark job running on that cluster reads from an S3 bucket and writes to 
that bucket.

the bucket and the ec2 run in the same region.

As part of our efforts to reduce the runtime of our spark jobs we found there's 
serious latency when reading from S3.

When the job:
·         reads the parquet files from S3 and also writes to S3, it takes 22 min
·         reads the parquet files from S3 and writes to its local hdfs, it 
takes the same amount of time (±22 min)
·         reads the parquet files from S3 (they were copied into the hdfs 
before) and writes to its local hdfs, the job took 7 min

the spark job has the following S3-related configuration:
·         spark.hadoop.fs.s3a.connection.establish.timeout=5000
·         spark.hadoop.fs.s3a.connection.maximum=200

when reading from S3 we tried to increase the 
spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but 
it didn't reduce the S3 latency.

Do you have any idea for the cause of the read latency from S3?

I saw this 
post<https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/>
 to improve the transfer speed, is something here relevant?

Thanks,
Tzahi

--
Tzahi File
Data Engineers Team Lead
[ironSource]<http://www.ironsrc.com/>
email tzahi.f...@ironsrc.com<mailto:tzahi.f...@ironsrc.com>
mobile +972-546864835<tel:+972-546864835>
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com<http://www.ironsrc.com/>
[linkedin]<https://www.linkedin.com/company/ironsource>[twitter]<https://twitter.com/ironsource>[facebook]<https://www.facebook.com/ironSource>[googleplus]<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended 
recipient and may contain confidential information which may be protected by 
legal privilege. If you are not the intended recipient, or the employee or 
agent responsible for delivering it to the intended recipient, you are hereby 
notified that any use, dissemination, distribution or copying of this 
communication and/or its content is strictly prohibited. If you are not the 
intended recipient, please immediately notify us by reply email or by 
telephone, delete this email and destroy any copies. Thank you.

RE: Spark performance over S3

Reply via email to