Is there any specific reason for caching the RDD? How many passes you make over 
the dataset? 

Mohammed

-----Original Message-----
From: Matt Narrell [mailto:matt.narr...@gmail.com] 
Sent: Saturday, October 3, 2015 9:50 PM
To: Mohammed Guller
Cc: davidkl; user@spark.apache.org
Subject: Re: laziness in textFile reading from HDFS?

Is there any more information or best practices here?  I have the exact same 
issues when reading large data sets from HDFS (larger than available RAM) and I 
cannot run without setting the RDD persistence level to MEMORY_AND_DISK_SER, 
and using nearly all the cluster resources.

Should I repartition this RDD to be equal to the number of cores?  

I notice that the job duration on the YARN UI is about 30 minutes longer than 
the Spark UI.  When the job initially starts, there is no tasks shown in the 
Spark UI..?

All I;m doing is reading records from HDFS text files with sc.textFile, and 
rewriting them back to HDFS grouped by a timestamp.

Thanks,
mn

> On Sep 29, 2015, at 8:06 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
> 
> 1) It is not required to have the same amount of memory as data. 
> 2) By default the # of partitions are equal to the number of HDFS 
> blocks
> 3) Yes, the read operation is lazy
> 4) It is okay to have more number of partitions than number of cores. 
> 
> Mohammed
> 
> -----Original Message-----
> From: davidkl [mailto:davidkl...@hotmail.com]
> Sent: Monday, September 28, 2015 1:40 AM
> To: user@spark.apache.org
> Subject: laziness in textFile reading from HDFS?
> 
> Hello,
> 
> I need to process a significant amount of data every day, about 4TB. This 
> will be processed in batches of about 140GB. The cluster this will be running 
> on doesn't have enough memory to hold the dataset at once, so I am trying to 
> understand how this works internally.
> 
> When using textFile to read an HDFS folder (containing multiple files), I 
> understand that the number of partitions created are equal to the number of 
> HDFS blocks, correct? Are those created in a lazy way? I mean, if the number 
> of blocks/partitions is larger than the number of cores/threads the Spark 
> driver was launched with (N), are N partitions created initially and then the 
> rest when required? Or are all those partitions created up front?
> 
> I want to avoid reading the whole data into memory just to spill it out to 
> disk if there is no enough memory.
> 
> Thanks! 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textFi
> le-reading-from-HDFS-tp24837.html Sent from the Apache Spark User List 
> mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to