i wrote a proof of concept to automatically store any RDD of tuples or case classes in columar format using arrays (and strongly typed, so you get the benefit of primitive arrays). see: https://github.com/tresata/spark-columnar
On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust <[email protected]> wrote: > Shark's in-memory code was ported to Spark SQL and is used by default when > you run .cache on a SchemaRDD or CACHE TABLE. > > I'd also look at parquet which is more efficient and handles nested data > better. > > On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf <[email protected]> > wrote: > >> Hi all, >> >> I'd like to build/use column oriented RDDs in some of my Spark code. A >> normal Spark RDD is stored as row oriented object if I understand >> correctly. >> >> I'd like to leverage some of the advantages of a columnar memory format. >> Shark (used to) and SparkSQL uses a columnar storage format using primitive >> arrays for each column. >> >> I'd be interested to know more about this approach and how I could build >> my own custom columnar-oriented RDD which I can use outside of Spark SQL. >> >> Could anyone give me some pointers on where to look to do something like >> this, either from scratch or using whats there in the SparkSQL libs or >> elsewhere. I know Evan Chan in a presentation made mention of building a >> custom RDD of column-oriented blocks of data. >> >> Cheers, >> ~N >> > >
