The disk size will not correctly reflect the size of the memory needed to store that data.Memory depends a lot on the data structures you are using . The objects that hold the data shall have their own overheads.
Try with reduced input size or more memory. On 21/11/14 8:26 am, "SK" <[email protected]> wrote: >Hi, > >I am using sc.textFile("shared_dir/*") to load all the files in a >directory >on a shared partition. The total size of the files in this directory is >1.2 >TB. We have a 16 node cluster with 3 TB memory (1 node is driver, 15 >nodes >are workers). But the loading fails after around 1 TB of data is read (in >the mapPartitions stage). Basically, there is no progress in >mapPartitions >after 1 TB of input. It seems that the cluster has sufficient memory but >not >sure why the program get stuck. > >1.2 TB of data divided across 15 worker nodes would require each node to >have about 80 GB of memory. Every node in the cluster is allocated around >170GB of memory. According to the spark documentation, the default storage >fraction for RDDs is 60% of the allocated memory. I have increased that to >0.8 (by setting --conf spark.storage.memorFraction=0.8) , so each node >should have around 136 GB of memory for storing RDDs. So I am not sure why >the program is failing in the mapPartitions stage where it seems to be >loading the data. > >I dont have a good idea about the Spark internals and would appreciate any >help in fixing this issue. > >thanks > > > > >-- >View this message in context: >http://apache-spark-user-list.1001560.n3.nabble.com/Spark-failing-when-loa >ding-large-amount-of-data-tp19441.html >Sent from the Apache Spark User List mailing list archive at Nabble.com. > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [email protected] >For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
