Hi everyone, We are migrating our ETL tasks from Spark 3.2.1 (Java 11) to Spark 3.5.2 (Java 17).
One of these applications that works fine on 3.2 completely kills our cluster on 3.5.2 The clusters consist of five 256GB workers and a 256GB master. The task is run with "--executor-memory 200G” and is completed in about 15 minutes on 3.2.1 However, when I run with "--executor-memory 200G” on 3.5.2, the workers all die eventually because the worker is unable to allocate more shared memory (as far as I can tell because they have to be rebooted). I then tried with "--executor-memory 100G”. This chugs along for about half an hour and then runs out of disk space (/tmp/ has about 125GB) for shared memory. The 3.2.1 Physical Plan is 11268 lines. The 3.5.2 Physical Plan is 12923 lines. All the consumed data consists of parquet files that live on S3 and are accessed using the s3a protocol configured as: spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider # Enables the hadoop s3a committer spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory spark.hadoop.fs.s3a.threads.max 40 spark.hadoop.fs.s3a.connection.maximum 40 The query itself is basically: final var catalogParts = partsSelectorA.selectParts() .union(partsSelectorB.selectParts()) .union(partsSelectorC.selectParts()) .union(partsSelectorD.selectParts()) .distinct() .persist(); This is followed by some further “lightweight" unions that can be ignored as I have tried excluding these with no effect. Each “selectParts()” method is a select statement on a huge table (~156M rows) combined with a half dozen or more left joins with large (~3M rows) tables. I’m considering trying the 3.5.3RC which resolves some left join issues. Any ideas? I can share more details privately if that can help. Regards, Steve Coy This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/