Re:Re: Can't understand the size of raw RDD and its DataFrame

Todd Sat, 15 Aug 2015 22:03:49 -0700

I thought that the df only contains one column, and actually contains only one 
resulting row(select avg(age) from theTable).
So,I would think that it would take less space,looks my understanding is run??








At 2015-08-16 12:34:31, "Rishi Yadav" <ri...@infoobjects.com> wrote:

why are you expecting footprint of dataframe to be lower when it contains more 
information ( RDD + Schema)


On Sat, Aug 15, 2015 at 6:35 PM, Todd <bit1...@163.com> wrote:

Hi,
With following code snippet, I cached the raw RDD(which is already in memory, 
but just for illustration) and its DataFrame.
I thought that the df cache would take less space than the rdd cache,which is 
wrong because from the UI that I see the rdd cache takes 168B,while the df 
cache takes 272B.
What data is cached when df.cache is called and actually cache the data?  It 
looks that the df only cached the avg(age) which should be much smaller in size,

val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
    val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22)))
    rdd.cache
    rdd.toDF().registerTempTable("TBL_STUDENT")
    val df = sqlContext.sql("select avg(age) from TBL_STUDENT")
    df.cache()
    df.show

Re:Re: Can't understand the size of raw RDD and its DataFrame

Reply via email to