why are you expecting footprint of dataframe to be lower when it contains more information ( RDD + Schema)
On Sat, Aug 15, 2015 at 6:35 PM, Todd <bit1...@163.com> wrote: > Hi, > With following code snippet, I cached the raw RDD(which is already in > memory, but just for illustration) and its DataFrame. > I thought that the df cache would take less space than the rdd cache,which > is wrong because from the UI that I see the rdd cache takes 168B,while the > df cache takes 272B. > What data is cached when df.cache is called and actually cache the data? > It looks that the df only cached the avg(age) which should be much smaller > in size, > > val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache") > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22))) > rdd.cache > rdd.toDF().registerTempTable("TBL_STUDENT") > val df = sqlContext.sql("select avg(age) from TBL_STUDENT") > df.cache() > df.show > >