My setup: using Pyspark; Mongodb to retrieve and store final results; Spark is in standalone cluster mode, on a single desktop. Spark v.2.4.4. Openjdk 8.
My spark application (using pyspark) uses all available system memory. This seems to be unrelated to the data being processed. I tested with 32GB and 64GB of RAM. In both cases memory use was about 1-2GB less than the total system memory. From searching, other people facing memory issues have large datasets, but the amount of data I am dealing with is much smaller. I have a dataframe that gets processed/aggregated into a smaller dataframe. It begins with a dataframe of about 1 million rows, ~100MB as a zipped csv file, then aggregates the results into a smaller dataframe of about 20 columns, 200 rows. The size of this smaller dataframe is 15KB (from Spark's UI page, under storage). I am using regular pyspark commands (including aggregating over several window functions) on dataframes. After a series of computations the end result is a 15KB dataframe. Then a series of Python UDFs and joins to calculate further results from this 15KB dataframe. This produces a similarly small dataframe. There is a particularly big spike in memory usage during these steps, even though the data processed is trivially small. Using the Spark UI page’s display of storage the RDDs shown use trivial (each RDD shown only KB in size) amounts of memory. This is the output of the “executors” tab from the Spark UI, under Storage Memory: 1.7 MB / 15.5 GB. The amount of memory that htop shows the JVMs are using is unrelated to the number of executors * RAM per executor setting in spark-submit. Why is Spark’s memory usage unrelated to any settings in spark-submit (standalone cluster mode)? What is the memory being used for? I can confirm from the Spark UI page that almost none of that is dataframe data? If I no longer have a reference to a dataframe in Python does that mean the corresponding dataframe has been garbage collected from Spark? Do dataframes in pyspark need to be explicitly garbage collected in some way?
signature.asc
Description: OpenPGP digital signature