The disk size will not correctly reflect the size of the memory needed to
store that data.Memory depends a lot on the data structures you are using .
The objects that hold the data shall have their own overheads.

Try with reduced input size or more memory.



On 21/11/14 8:26 am, "SK" <[email protected]> wrote:

>Hi,
>
>I am using sc.textFile("shared_dir/*")  to load all the files in a
>directory
>on a shared partition. The total size of the files in this directory is
>1.2
>TB. We have a 16  node cluster with 3 TB memory (1 node is driver, 15
>nodes
>are workers). But the loading fails after around 1 TB of data is read (in
>the mapPartitions stage). Basically, there  is no progress in
>mapPartitions
>after 1 TB of input. It seems that the cluster has sufficient memory but
>not
>sure why the program get stuck.
>
>1.2 TB of data divided across 15 worker nodes would require each node to
>have about 80 GB of memory. Every node in the cluster is allocated around
>170GB of memory. According to the spark documentation, the default storage
>fraction for RDDs is 60% of the allocated memory. I have increased that to
>0.8 (by setting --conf spark.storage.memorFraction=0.8) , so each node
>should have around 136 GB of memory for storing RDDs. So I am not sure why
>the program is failing in the mapPartitions stage where it seems to be
>loading the data. 
>
>I dont have a good idea about the Spark internals and would appreciate any
>help in fixing this issue.
>
>thanks
>   
>
>
>
>--
>View this message in context:
>http://apache-spark-user-list.1001560.n3.nabble.com/Spark-failing-when-loa
>ding-large-amount-of-data-tp19441.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [email protected]
>For additional commands, e-mail: [email protected]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to