Hi,
I have a Spark-SQL Dataframe (reading from parquet) with some 20
columns. The data is divided into chunks of about 50 million rows each.
Among the columns is a "GROUP_ID", which is basically a string of 32
hexadecimal characters.
Following the guide [0] I thought to improve on performance and memory
consumption if I converted this to some other format. I came up with two
possiblities:
- cut in half, convert to "long" (loosing some information, but is fine
for "within-chunk-processing")
- convert to byte array
Doing some math, without considering the "objects overhead", this should
cut the memory usage of the "GROUP_ID" column in half (for byte array) /
in quarter (for the "long" version).
However, neither in the SparkUI (Storage Tab) nor using
"SizeEstimator.estimate()" showed such an effect - memory consumption
essentially stayed the same.
In case lineage (the dataframe's "history") mattered, I also tried to
write the new dataframe to parquet and read it up again - the result
still was the same.
Usually I use Python, but in order to use SizeEstimator I did this one
in Java. My code - for converting to "long" - is on [1].
Any idea about what I am missing?
Thanks!
[0]
http://spark.apache.org/docs/latest/tuning.html#tuning-data-structures
[1]
https://gist.github.com/TwUxTLi51Nus/2aba16163bf01a4d6417bb65f6fe5b38
--
Tw UxTLi51Nus
Email: twuxtli51...@posteo.co
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org