Well, dumb question: Given the workflow outlined above, should Local Mode keep running? Or is the leak a known issue? I just wanted to check because I can't recall seeing this issue with a non-local master, though it's possible there were task failures that hid the issue.
If this issue looks new, what's the easiest way to record memory dumps or do profiling? Can I put something in my spark-defaults.conf ? The code is open source and the run is reproducible, although the specific test currently requires a rather large (but public) dataset. On Sun, Oct 20, 2019 at 6:24 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > > Honestly I'd recommend you to spend you time to look into the issue, via > taking memory dump per some interval and compare differences (at least share > these dump files to community with redacting if necessary). Otherwise someone > has to try to reproduce without reproducer and even couldn't reproduce even > they spent their time. Memory leak issue is not really easy to reproduce, > unless it leaks some objects without any conditions. > > - Jungtaek Lim (HeartSaVioR) > > On Sun, Oct 20, 2019 at 7:18 PM Paul Wais <paulw...@gmail.com> wrote: >> >> Dear List, >> >> I've observed some sort of memory leak when using pyspark to run ~100 >> jobs in local mode. Each job is essentially a create RDD -> create DF >> -> write DF sort of flow. The RDD and DFs go out of scope after each >> job completes, hence I call this issue a "memory leak." Here's >> pseudocode: >> >> ``` >> row_rdds = [] >> for i in range(100): >> row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)]) >> row_rdds.append(row_rdd) >> >> for row_rdd in row_rdds: >> df = spark.createDataFrame(row_rdd) >> df.persist() >> print(df.count()) >> df.write.save(...) # Save parquet >> df.unpersist() >> >> # Does not help: >> # del df >> # del row_rdd >> ``` >> >> In my real application: >> * rows are much larger, perhaps 1MB each >> * row_rdds are sized to fit available RAM >> >> I observe that after 100 or so iterations of the second loop (each of >> which creates a "job" in the Spark WebUI), the following happens: >> * pyspark workers have fairly stable resident and virtual RAM usage >> * java process eventually approaches resident RAM cap (8GB standard) >> but virtual RAM usage keeps ballooning. >> >> Eventually the machine runs out of RAM and the linux OOM killer kills >> the java process, resulting in an "IndexError: pop from an empty >> deque" error from py4j/java_gateway.py . >> >> >> Does anybody have any ideas about what's going on? Note that this is >> local mode. I have personally run standalone masters and submitted a >> ton of jobs and never seen something like this over time. Those were >> very different jobs, but perhaps this issue is bespoke to local mode? >> >> Emphasis: I did try to del the pyspark objects and run python GC. >> That didn't help at all. >> >> pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image) >> >> 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*). >> >> Cheers, >> -Paul >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org