why are you expecting footprint of dataframe to be lower when it contains
more information ( RDD + Schema)

On Sat, Aug 15, 2015 at 6:35 PM, Todd <bit1...@163.com> wrote:

> Hi,
> With following code snippet, I cached the raw RDD(which is already in
> memory, but just for illustration) and its DataFrame.
> I thought that the df cache would take less space than the rdd cache,which
> is wrong because from the UI that I see the rdd cache takes 168B,while the
> df cache takes 272B.
> What data is cached when df.cache is called and actually cache the data?
> It looks that the df only cached the avg(age) which should be much smaller
> in size,
>
> val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache")
>     val sc = new SparkContext(conf)
>     val sqlContext = new SQLContext(sc)
>     import sqlContext.implicits._
>     val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22)))
>     rdd.cache
>     rdd.toDF().registerTempTable("TBL_STUDENT")
>     val df = sqlContext.sql("select avg(age) from TBL_STUDENT")
>     df.cache()
>     df.show
>
>

Reply via email to