On Thu, Jun 12, 2014 at 3:15 PM, FRANK AUSTIN NOTHAFT <fnoth...@berkeley.edu > wrote:
> RE: > > > Given that our agg sizes will exceed memory, we expect to cache them to > disk, so save-as-object (assuming there are no out of the ordinary > performance issues) may solve the problem, but I was hoping to store data > is a column orientated format. However I think this in general is not > possible - Spark can *read* Parquet, but I think it cannot write Parquet as > a disk-based RDD format. > > Spark can write Parquet, via the ParquetOutputFormat which is distributed > from Parquet. If you'd like example code for writing out to Parquet, please > see the adamSave function in > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala, > starting at line 62. There is a bit of setup necessary for the Parquet > write codec, but otherwise it is fairly straightforward. > Thankyou, Frank. My thought is to generate an aggregated RDD from our full data set, where the aggregated RDD will be about 10% of the size of the full data set, and will be stored to disk in column store, to be loaded by future jobs. In these future jobs, when I come to load the aggregted RDD, will Spark load and only load the columns being accessed by the query? or will Spark load everything, to convert it into an internal representation, and then execute the query?