Hello, I'm using Spark to deal with my project these days, however i noticed that when load data
stored in Hadoop hdfs, it seems that there is a huge difference in JVM memory size between using DataFrame and using RDD format.Below lists my shell script when using spark-shell, my original files(testData) are just ordinary text files which is about 11GB when stored in hard disk,each line has the format of "Id1,Id2" where both Id1 and Id2 are some random numbers of int32. /* code segment import java.io.DataOutputStream import java.util import org.apache.spark.sql.types._ import org.apache.spark.sql.{Dataset, Row, SparkSession} import scala.collection.mutable.ArrayBuffer // this text file's size is 11GB in hard disk var filePath = "hdfs://10.10.23.105:9000/testData" val fields = Array.range(0, 2).map(i => StructField(s"col$i", IntegerType)) val schema: StructType = new StructType(fields) val df: Dataset[Row] = spark.read.format("csv").schema(schema).load(filePath) // the fisrt dataframe which turn out to be 5.5GB in memory df.cache() df.count() // the second datafame which turn out to be 95GB in memory df.rdd.cache() df.rdd.count() // the third rdd format which turn out to be 88GB in memory val pureRDD= spark.sparkContext.textFile(filePath) pureRDD.cache() pureRDD.count() //the line below gose wrong when i using collect() even driver has 200GB and executor have 300GB memory allocated df.collect() */ So here I encountered 2 problems: Q1: I loaded and cached the very identical raw file into 3 types format respectively as showed above :DataFrame, DataFrame.rdd, RDD. Then I founded that DataFrame used just 5.5GB in my JVM , however df.rdd used nearly 95GB and RDD used about 69GB .So I'am wondering why RDD or DataFrame.rdd will take so much memory space even the original files are very small? Q2: And I also noticed that when i called df.collect(),it will keep blocking without exeption or further information, while using RDD.collect() won't cause this problem and can return the result successfully. (P.S. my driver is allocated 200GB alone with a 300GB executor in JVM heap, which is sufficient enough for such a collect action.) Hoping your attention and help Best regards with thanks! Department of Engineering Mechanics Zhejiang University Hangzhou 310027, P.R. China Mobile: (+86)15158859317 E-mail: lyx_z...@zju.edu.cn 发自我的iPhone