It is just a goal… however I would not tune the no of regions or region size yet.

Simply specify gc algorithm and max heap size.

Try to tune other options only if there is a need, only one at at time (otherwise it is difficult to determine cause/effects) and have a performance testing framework in place to be able to measure differences.

Do you need those large heaps in Spark? Why not split the tasks further to have more tasks with less memory ?
I understand that each job is different and there can be reasons for it, but I often try to just use the defaults and then tune individual options. I try to also avoid certain extreme values (of course there are cases when they are needed). Especially often when upgrading from one Spark version to another then I find out it is then often better to work with a Spark job with default settings, because Spark itself has improved/changed how it works.

To reduce the needed heap you can try to increase the number of tasks ( see here https://spark.apache.org/docs/latest/configuration.html)

spark.executor.cores (to a few) and spark.sql.shuffle.partitions (default is 200 - you can try how much it brings to change it to 400 etc).

and reduce
spark.executor.memory

Am 10.12.2023 um 02:33 schrieb Faiz Halde <haldef...@gmail.com>:


Thanks, IL check them out

Curious though, the official G1GC page https://www.oracle.com/technical-resources/articles/java/g1gc.html says that there must be no more than 2048 regions and region size is limited upto 32mb

That's strange because our heaps go up to 100gb and that would require 64mb region size to be under 2048

Thanks
Faiz

On Sat, Dec 9, 2023, 10:33 Luca Canali <luca.can...@cern.ch> wrote:

Hi Faiz,

 

We find G1GC works well for some of our workloads that are Parquet-read intensive and we have been using G1GC with Spark on Java 8 already (spark.driver.extraJavaOptions and spark.executor.extraJavaOptions= “-XX:+UseG1GC”), while currently we are mostly running Spark (3.3 and higher) on Java 11. 

However, the best is always to refer to measurements of your specific workloads, let me know if you find something different. 
BTW besides the WebUI, I typically measure GC time also with a couple of custom tools: https://github.com/cerndb/spark-dashboard and  https://github.com/LucaCanali/sparkMeasure 

A few tests of microbenchmarking Spark reading Parquet with a few different JDKs at: https://db-blog.web.cern.ch/node/192 

 

Best,

Luca

 

 

From: Faiz Halde <haldef...@gmail.com>
Sent: Thursday, December 7, 2023 23:25
To: user@spark.apache.org
Subject: Spark on Java 17

 

Hello,

 

We are planning to switch to Java 17 for Spark and were wondering if there's any obvious learnings from anybody related to JVM tuning?

 

We've been running on Java 8 for a while now and used to use the parallel GC as that used to be a general recommendation for high throughout systems. How has the default G1GC worked out with Spark?

 

Thanks

Faiz

Reply via email to