Would like to add that compression schemes built in in-memory columnar
storage only supports primitive columns (int, string, etc.), complex
types like array, map and struct are not supported.
On 12/20/14 6:17 AM, Sadhan Sood wrote:
Hey Michael,
Thank you for clarifying that. Is tachyon the right way to get
compressed data in memory or should we explore the option of adding
compression to cached data. This is because our uncompressed data set
is too big to fit in memory right now. I see the benefit of tachyon
not just with storing compressed data in memory but we wouldn't have
to create a separate table for caching some partitions like 'cache
table table_cached as select * from table where date = 201412XX' - the
way we are doing right now.
On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust
<mich...@databricks.com <mailto:mich...@databricks.com>> wrote:
There is only column level encoding (run length encoding, delta
encoding, dictionary encoding) and no generic compression.
On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood
<sadhan.s...@gmail.com <mailto:sadhan.s...@gmail.com>> wrote:
Hi All,
Wondering if when caching a table backed by lzo compressed
parquet data, if spark also compresses it (using
lzo/gzip/snappy) along with column level encoding or just does
the column level encoding when
"*spark.sql.inMemoryColumnarStorage.compressed" *is set to
true. This is because when I try to cache the data, I notice
the memory being used is almost as much as the uncompressed
size of the data.
Thanks!