Thanks Michael, that makes sense. On Fri, Dec 19, 2014 at 3:13 PM, Michael Armbrust <mich...@databricks.com> wrote:
> Yeah, tachyon does sound like a good option here. Especially if you have > nested data, its likely that parquet in tachyon will always be better > supported. > > On Fri, Dec 19, 2014 at 2:17 PM, Sadhan Sood <sadhan.s...@gmail.com> > wrote: >> >> Hey Michael, >> >> Thank you for clarifying that. Is tachyon the right way to get compressed >> data in memory or should we explore the option of adding compression to >> cached data. This is because our uncompressed data set is too big to fit in >> memory right now. I see the benefit of tachyon not just with storing >> compressed data in memory but we wouldn't have to create a separate table >> for caching some partitions like 'cache table table_cached as select * from >> table where date = 201412XX' - the way we are doing right now. >> >> >> On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >>> >>> There is only column level encoding (run length encoding, delta >>> encoding, dictionary encoding) and no generic compression. >>> >>> On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood <sadhan.s...@gmail.com> >>> wrote: >>>> >>>> Hi All, >>>> >>>> Wondering if when caching a table backed by lzo compressed parquet >>>> data, if spark also compresses it (using lzo/gzip/snappy) along with column >>>> level encoding or just does the column level encoding when >>>> "*spark.sql.inMemoryColumnarStorage.compressed" >>>> *is set to true. This is because when I try to cache the data, I >>>> notice the memory being used is almost as much as the uncompressed size of >>>> the data. >>>> >>>> Thanks! >>>> >>>