1) It is not required to have the same amount of memory as data. 
2) By default the # of partitions are equal to the number of HDFS blocks
3) Yes, the read operation is lazy
4) It is okay to have more number of partitions than number of cores. 

Mohammed

-----Original Message-----
From: davidkl [mailto:davidkl...@hotmail.com] 
Sent: Monday, September 28, 2015 1:40 AM
To: user@spark.apache.org
Subject: laziness in textFile reading from HDFS?

Hello,

I need to process a significant amount of data every day, about 4TB. This will 
be processed in batches of about 140GB. The cluster this will be running on 
doesn't have enough memory to hold the dataset at once, so I am trying to 
understand how this works internally.

When using textFile to read an HDFS folder (containing multiple files), I 
understand that the number of partitions created are equal to the number of 
HDFS blocks, correct? Are those created in a lazy way? I mean, if the number of 
blocks/partitions is larger than the number of cores/threads the Spark driver 
was launched with (N), are N partitions created initially and then the rest 
when required? Or are all those partitions created up front?

I want to avoid reading the whole data into memory just to spill it out to disk 
if there is no enough memory.

Thanks! 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textFile-reading-from-HDFS-tp24837.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to