Hi,
I am reading a Hive Orc table into memory, StorageLevel is set to
(StorageLevel.MEMORY_AND_DISK_SER)
Total size of the Hive table is 5GB
Started the spark-shell as below
spark-shell --master yarn --deploy-mode client --num-executors 8
--driver-memory 5G --executor-memory 7G --executor-cores 2 --conf
spark.yarn.executor.memoryOverhead=512
I have 10 node cluster each with 35 GB memory and 4 cores running on HDP 2.5
SPARK_LOCAL_DIRS location has enough space
My concern is below simple code to load data to memory takes approx. 10-12 mins.
If I change values for
num-executors/driver-memory/executor-memory/executor-cores other than above
mentioned I get "No space left on device" error
While running each nodes consumes varying size of memory from 7GB to 20 GB
import org.apache.spark.storage.StorageLevel
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
val tab1 = sqlContext.sql("select * from
xyz").repartition(150).persist(StorageLevel.MEMORY_AND_DISK_SER)
tab1.registerTempTable("AUDIT")
tab1.count()
kindly advice how to improve the performance of loading Hive table to Spark
memory and avoid the space issue
Regards,
~Sri