spark.sql.shuffle.partitions might be a start.

Is there a difference in the number of partitions when the parquet is read
to spark.sql.shuffle.partitions? Is it much higher than
spark.sql.shuffle.partitions?

On Fri, 20 Dec 2019, 7:34 pm Ruijing Li, <liruijin...@gmail.com> wrote:

> Hi all,
>
> I have encountered a strange executor OOM error. I have a data pipeline
> using Spark 2.3 Scala 2.11.12. This pipeline writes the output to one HDFS
> location as parquet then reads the files back in and writes to multiple
> hadoop clusters (all co-located in the same datacenter).  It should be a
> very simple task, but executors are being killed off exceeding container
> thresholds. From logs, it is exceeding given memory (using Mesos as the
> cluster manager).
>
> The ETL process works perfectly fine with the given resources, doing joins
> and adding columns. The output is written successfully the first time. *Only
> when the pipeline at the end reads the output from HDFS and writes it to
> different HDFS cluster paths does it fail.* (It does a
> spark.read.parquet(source).write.parquet(dest))
>
> This doesn't really make sense and I'm wondering what configurations I
> should start looking at.
>
> --
> Cheers,
> Ruijing Li
> --
> Cheers,
> Ruijing Li
>

Reply via email to