I would try to track down the "no space left on device" - find out where
that originates from, since you should be able to allocate 10 executors
with 4 cores and 15GB RAM each quite easily. In that case,you may want to
increase overhead, so yarn doesn't kill your executors.
Check that no local drives are filling up with temporary data, by runnning
a watch df on all nodes,
Also check that no quotas are being enforced, and that your log-partitions
aren't flowing over.

Depending on your disk and network speed, as well as the time it takes yarn
to allocate resources and spark to initialize the spark context, 10 minutes
doesn't sound too bad. Also, I don't think 150 partitions are a helpful
partition size, if you have 7G RAM per executor, and aren't doing any
joining or other memory intensive calculation. Try again with 64
partitions, and see if the reduced overhead helps.
Also, track which action/task are running longer than expected in SparkUI.
That sohuld help ID where your bottleneck is located.

On Thu, May 11, 2017 at 5:46 PM, Anantharaman, Srinatha (Contractor) <
srinatha_ananthara...@comcast.com> wrote:

> Hi,
>
>
>
> I am reading a Hive Orc table into memory, StorageLevel is set to
> (StorageLevel.MEMORY_AND_DISK_SER)
>
> Total size of the Hive table is 5GB
>
> Started the spark-shell as below
>
>
>
> spark-shell --master yarn --deploy-mode client --num-executors 8
> --driver-memory 5G --executor-memory 7G --executor-cores 2 --conf
> spark.yarn.executor.memoryOverhead=512
>
> I have 10 node cluster each with 35 GB memory and 4 cores running on HDP
> 2.5
>
> SPARK_LOCAL_DIRS location has enough space
>
>
>
> My concern is below simple code to load data to memory takes approx. 10-12
> mins.
>
> If I change values for 
> num-executors/driver-memory/executor-memory/executor-cores
> other than above mentioned I get “No space left on device” error
>
> While running each nodes consumes varying size of memory from 7GB to 20 GB
>
>
>
> import org.apache.spark.storage.StorageLevel
>
>
>
>
>
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
> sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
>
> sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.
> recursive=true")
>
> val tab1 =  sqlContext.sql("select * from xyz").repartition(150).
> persist(StorageLevel.MEMORY_AND_DISK_SER)
>
> tab1.registerTempTable("AUDIT")
>
> tab1.count()
>
>
>
> kindly advice how to improve the performance of loading Hive table to
> Spark memory and avoid the space issue
>
>
>
> Regards,
>
> ~Sri
>

Reply via email to