Hi,

I have a Spark-SQL Dataframe (reading from parquet) with some 20 columns. The data is divided into chunks of about 50 million rows each. Among the columns is a "GROUP_ID", which is basically a string of 32 hexadecimal characters.

Following the guide [0] I thought to improve on performance and memory consumption if I converted this to some other format. I came up with two possiblities: - cut in half, convert to "long" (loosing some information, but is fine for "within-chunk-processing")
- convert to byte array

Doing some math, without considering the "objects overhead", this should cut the memory usage of the "GROUP_ID" column in half (for byte array) / in quarter (for the "long" version). However, neither in the SparkUI (Storage Tab) nor using "SizeEstimator.estimate()" showed such an effect - memory consumption essentially stayed the same.

In case lineage (the dataframe's "history") mattered, I also tried to write the new dataframe to parquet and read it up again - the result still was the same.

Usually I use Python, but in order to use SizeEstimator I did this one in Java. My code - for converting to "long" - is on [1].

Any idea about what I am missing?

Thanks!



[0] http://spark.apache.org/docs/latest/tuning.html#tuning-data-structures [1] https://gist.github.com/TwUxTLi51Nus/2aba16163bf01a4d6417bb65f6fe5b38

--
Tw UxTLi51Nus
Email: twuxtli51...@posteo.co

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to