[Spark SQL] issue about diffrence in memory size between DataFrame and RDD

Lyx Sun, 19 Apr 2020 20:14:22 -0700

Hello,

&nbsp; &nbsp;I'm using Spark to deal with my project these days, however i 
noticed that when load data


stored in Hadoop hdfs, it seems that there is a huge difference in JVM memory 
size between using DataFrame

and using RDD format.Below lists my shell script&nbsp; when using spark-shell, 
my original files(testData) are just ordinary text files 

which is about 11GB when stored in hard disk,each line has the format of 
"Id1,Id2" where both Id1 and Id2 are some random numbers of int32.

/* code segment 

import java.io.DataOutputStream
import java.util
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Dataset, Row, SparkSession}
import scala.collection.mutable.ArrayBuffer

// this text file's size is 11GB in hard disk
var filePath = "hdfs://10.10.23.105:9000/testData"


val fields = Array.range(0, 2).map(i =&gt; StructField(s"col$i", IntegerType))
val schema: StructType = new StructType(fields)

val df: Dataset[Row] = spark.read.format("csv").schema(schema).load(filePath)

// the fisrt dataframe which turn out to be 5.5GB in memory
df.cache()
df.count()

// the second datafame which turn out to be 95GB in memory
df.rdd.cache()
df.rdd.count()

// the third rdd format which turn out to be 88GB in memory
val pureRDD= spark.sparkContext.textFile(filePath)
pureRDD.cache()
pureRDD.count()

//the line below gose wrong when i using collect() even driver has 200GB and 
executor have 300GB memory allocated
df.collect()

*/




&nbsp; So here I encountered 2 problems:

Q1: I loaded and cached the very identical raw file into 3 types format 
respectively&nbsp;as showed 
above&nbsp;:DataFrame,&nbsp;DataFrame.rdd,&nbsp;RDD. Then I founded that 
DataFrame used just 5.5GB in my JVM , however df.rdd used nearly 95GB and RDD 
used about 69GB .So I'am wondering why RDD or DataFrame.rdd will take so much 
memory space even the original files are very small?




Q2: And I also noticed that when i called df.collect()，it will keep blocking 
without exeption or further information, while using RDD.collect() won't cause 
this problem and can return the result successfully.

(P.S. my driver is allocated 200GB alone with a 300GB executor in JVM heap, 
which is sufficient enough for such a collect action.)

&nbsp; &nbsp;

&nbsp; &nbsp;Hoping your attention and help

&nbsp; &nbsp;Best regards with thanks！




&nbsp; 




Department of Engineering Mechanics


Zhejiang University


Hangzhou 310027,&nbsp; P.R. China


Mobile: （+86）15158859317


E-mail: lyx_z...@zju.edu.cn



发自我的iPhone

[Spark SQL] issue about diffrence in memory size between DataFrame and RDD

Reply via email to