On Thu, Oct 31, 2019 at 10:04 PM Nicolas Paris <nicolas.pa...@riseup.net> wrote:
> have you deactivated the spark.ui ? > I have read several thread explaining the ui can lead to OOM because it > stores 1000 dags by default > > > On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote: > > Dear List, > > > > I've observed some sort of memory leak when using pyspark to run ~100 > > jobs in local mode. Each job is essentially a create RDD -> create DF > > -> write DF sort of flow. The RDD and DFs go out of scope after each > > job completes, hence I call this issue a "memory leak." Here's > > pseudocode: > > > > ``` > > row_rdds = [] > > for i in range(100): > > row_rdd = spark.sparkContext.parallelize([{'a': i} for i in > range(1000)]) > > row_rdds.append(row_rdd) > > > > for row_rdd in row_rdds: > > df = spark.createDataFrame(row_rdd) > > df.persist() > > print(df.count()) > > df.write.save(...) # Save parquet > > df.unpersist() > > > > # Does not help: > > # del df > > # del row_rdd > > ``` The connection between Python GC/del and JVM GC is perhaps a bit weaker than we might like. There certainly could be a problem here, but it still shouldn’t be getting to the OOM state. > > > > > In my real application: > > * rows are much larger, perhaps 1MB each > > * row_rdds are sized to fit available RAM > > > > I observe that after 100 or so iterations of the second loop (each of > > which creates a "job" in the Spark WebUI), the following happens: > > * pyspark workers have fairly stable resident and virtual RAM usage > > * java process eventually approaches resident RAM cap (8GB standard) > > but virtual RAM usage keeps ballooning. > > Can you share what flags the JVM is launching with? Also which JVM(s) are ballooning? > > > Eventually the machine runs out of RAM and the linux OOM killer kills > > the java process, resulting in an "IndexError: pop from an empty > > deque" error from py4j/java_gateway.py . > > > > > > Does anybody have any ideas about what's going on? Note that this is > > local mode. I have personally run standalone masters and submitted a > > ton of jobs and never seen something like this over time. Those were > > very different jobs, but perhaps this issue is bespoke to local mode? > > > > Emphasis: I did try to del the pyspark objects and run python GC. > > That didn't help at all. > > > > pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image) > > > > 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*). > > > > Cheers, > > -Paul > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > -- > nicolas > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau