Hi everyone,
We are migrating our ETL tasks from Spark 3.2.1 (Java 11) to Spark 3.5.2 (Java
17).
One of these applications that works fine on 3.2 completely kills our cluster
on 3.5.2
The clusters consist of five 256GB workers and a 256GB master.
The task is run with "--executor-memory 200G” and is completed in about 15
minutes on 3.2.1
However, when I run with "--executor-memory 200G” on 3.5.2, the workers all
die eventually because the worker is unable to allocate more shared memory (as
far as I can tell because they have to be rebooted). I then tried with
"--executor-memory 100G”. This chugs along for about half an hour and then runs
out of disk space (/tmp/ has about 125GB) for shared memory.
The 3.2.1 Physical Plan is 11268 lines.
The 3.5.2 Physical Plan is 12923 lines.
All the consumed data consists of parquet files that live on S3 and are
accessed using the s3a protocol configured as:
spark.hadoop.fs.s3a.aws.credentials.provider
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider
# Enables the hadoop s3a committer
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a
org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.threads.max 40
spark.hadoop.fs.s3a.connection.maximum 40
The query itself is basically:
final var catalogParts = partsSelectorA.selectParts()
.union(partsSelectorB.selectParts())
.union(partsSelectorC.selectParts())
.union(partsSelectorD.selectParts())
.distinct()
.persist();
This is followed by some further “lightweight" unions that can be ignored as I
have tried excluding these with no effect.
Each “selectParts()” method is a select statement on a huge table (~156M rows)
combined with a half dozen or more left joins with large (~3M rows) tables.
I’m considering trying the 3.5.3RC which resolves some left join issues.
Any ideas?
I can share more details privately if that can help.
Regards,
Steve Coy
This email contains confidential information of and is the copyright of
Infomedia. It must not be forwarded, amended or disclosed without consent of
the sender. If you received this message by mistake, please advise the sender
and delete all copies. Security of transmission on the internet cannot be
guaranteed, could be infected, intercepted, or corrupted and you should ensure
you have suitable antivirus protection in place. By sending us your or any
third party personal details, you consent to (or confirm you have obtained
consent from such third parties) to Infomedia’s privacy policy.
http://www.infomedia.com.au/privacy-policy/