Spark Memory Optimization

Tw UxTLi51Nus Fri, 23 Jun 2017 00:33:07 -0700

Hi,

I have a Spark-SQL Dataframe (reading from parquet) with some 20columns. The data is divided into chunks of about 50 million rows each.Among the columns is a "GROUP_ID", which is basically a string of 32hexadecimal characters.

Following the guide [0] I thought to improve on performance and memoryconsumption if I converted this to some other format. I came up with twopossiblities:- cut in half, convert to "long" (loosing some information, but is finefor "within-chunk-processing")

- convert to byte array

Doing some math, without considering the "objects overhead", this shouldcut the memory usage of the "GROUP_ID" column in half (for byte array) /in quarter (for the "long" version).However, neither in the SparkUI (Storage Tab) nor using"SizeEstimator.estimate()" showed such an effect - memory consumptionessentially stayed the same.

In case lineage (the dataframe's "history") mattered, I also tried towrite the new dataframe to parquet and read it up again - the resultstill was the same.

Usually I use Python, but in order to use SizeEstimator I did this onein Java. My code - for converting to "long" - is on [1].


Any idea about what I am missing?

Thanks!

[0]http://spark.apache.org/docs/latest/tuning.html#tuning-data-structures[1]https://gist.github.com/TwUxTLi51Nus/2aba16163bf01a4d6417bb65f6fe5b38


--
Tw UxTLi51Nus
Email: twuxtli51...@posteo.co

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark Memory Optimization

Reply via email to