Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Dhrubajyoti Hati
Actually the original data is around ~120 GB. If we provide higher memory then we might require an even bigger cluster to finish training the whole model within planned time. And this will affect the cost of operations. Please correct me if I am wrong here. Nevertheless, can you point out how much

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Sean Owen
Could be lots of things. Implementations change, caching may have changed, etc. The size of the input doesn't really directly translate to heap usage. Here you just need a bit more memory. On Mon, Jul 29, 2019 at 9:03 AM Dhrubajyoti Hati wrote: > > Hi Sean, > > Yeah I checked the heap, its almost

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Dhrubajyoti Hati
Hi Sean, Yeah I checked the heap, its almost full. I checked the GC logs in the executors where I found that GC cycles are kicking in frequently. The Executors tab shows red in the "Total Time/GC Time". Also the data which I am dealing with is quite small(~4 GB) and the cluster is quite big for t

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Sean Owen
-dev@ Yep, high GC activity means '(almost) out of memory'. I don't see that you've checked heap usage - is it nearly full? The answer isn't tuning but more heap. (Sometimes with really big heaps the problem is big pauses, but that's not the case here.) On Mon, Jul 29, 2019 at 1:26 AM Dhrubajyoti

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Dhrubajyoti Hati
Actually I didn't have any of the GC tuning in the beginning and then adding them also didn't made any difference. As mentioned earlier I tried low number executors of higher configuration and vice versa. Nothing helps. About the code its simple logistic regression nothing with explicit broadcast o

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Jörn Franke
I would remove the all GC tuning and add it later once you found the underlying root cause. Usually more GC means you need to provide more memory, because something has changed (your application, spark Version etc.) We don’t have your full code to give exact advise, but you may want to rethink

Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-28 Thread Dhrubajyoti Hati
Hi, We were running Logistic Regression in Spark 2.2.X and then we tried to see how does it do in Spark 2.3.X. Now we are facing an issue while running a Logistic Regression Model in Spark 2.3.X on top of Yarn(GCP-Dataproc). In the TreeAggregate method it takes a huge time due to very High GC Acti