spark.sql.shuffle.partitions might be a start. Is there a difference in the number of partitions when the parquet is read to spark.sql.shuffle.partitions? Is it much higher than spark.sql.shuffle.partitions?
On Fri, 20 Dec 2019, 7:34 pm Ruijing Li, <liruijin...@gmail.com> wrote: > Hi all, > > I have encountered a strange executor OOM error. I have a data pipeline > using Spark 2.3 Scala 2.11.12. This pipeline writes the output to one HDFS > location as parquet then reads the files back in and writes to multiple > hadoop clusters (all co-located in the same datacenter). It should be a > very simple task, but executors are being killed off exceeding container > thresholds. From logs, it is exceeding given memory (using Mesos as the > cluster manager). > > The ETL process works perfectly fine with the given resources, doing joins > and adding columns. The output is written successfully the first time. *Only > when the pipeline at the end reads the output from HDFS and writes it to > different HDFS cluster paths does it fail.* (It does a > spark.read.parquet(source).write.parquet(dest)) > > This doesn't really make sense and I'm wondering what configurations I > should start looking at. > > -- > Cheers, > Ruijing Li > -- > Cheers, > Ruijing Li >