Actually I didn't have any of the GC tuning in the beginning and then
adding them also didn't made any difference. As mentioned earlier I tried
low number executors of higher configuration and vice versa. Nothing helps.
About the code its simple logistic regression nothing with explicit
broadcast or anything. The data is stored as parquet files in GCS bucket.

*The question is how does it work fine with Spark 2.2 and has this issue
with Spark 2.3 or higher version. As I mentioned before same code, same
cluster configuration and size, same data in both the cases.*


Regards,
Dhrub

On Mon, Jul 29, 2019 at 12:37 PM Jörn Franke <jornfra...@gmail.com> wrote:

> I would remove the all GC tuning and add it later once you found the
> underlying root cause. Usually more GC means you need to provide more
> memory, because something has changed (your application, spark Version etc.)
>
> We don’t have your full code to give exact advise, but you may want to
> rethink the one code / executor approach and have less executors but more
> cores / executor. That sometimes can lead to more heap usage (especially if
> you broadcast). Keep in mind that if you use more cores/executor it usually
> also requires more memory for the executor, but less executors. Similarly
> the executor instances might be too many and they may not have enough heap.
> You can also increase the memory of the executor.
>
> Am 29.07.2019 um 08:22 schrieb Dhrubajyoti Hati <dhruba.w...@gmail.com>:
>
> Hi,
>
> We were running Logistic Regression in Spark 2.2.X and then we tried to
> see how does it do in Spark 2.3.X. Now we are facing an issue while running
> a Logistic Regression Model in Spark 2.3.X on top of Yarn(GCP-Dataproc). In
> the TreeAggregate method it takes a huge time due to very High GC Activity.
> I have tuned the GC, created different sized clusters, higher spark
> version(2.4.X), smaller data but nothing helps. The GC time is 100 - 1000
> times of the processing time in avg for iterations.
>
> The strange part is in *Spark 2.2 this doesn't happen at all*. Same code,
> same cluster sizing, same data in both the cases.
>
> I was wondering if someone can explain this behaviour and help me to
> resolve this. How can the same code has so different behaviour in two Spark
> version, especially the higher ones?
>
> Here are the config which I used:
>
>
> spark.serializer=org.apache.spark.serializer.KryoSerializer
>
> #GC Tuning
>
> spark.executor.extraJavaOptions= -XX:+UseG1GC -XX:+PrintFlagsFinal
> -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy
> -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -Xms9000m
> -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
>
>
> spark.executor.instances=20
>
> spark.executor.cores=1
>
> spark.executor.memory=9010m
>
>
> Regards,
> Dhrub
>
>

Reply via email to