One.

I read in LZO compressed files from HDFS
Perform a map operation
cache the results of this map operation
call saveAsHadoopFile to write LZO back to HDFS.

Without the cache, the job will stall.  

mn

> On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
> 
> Is there any specific reason for caching the RDD? How many passes you make 
> over the dataset? 
> 
> Mohammed
> 
> -----Original Message-----
> From: Matt Narrell [mailto:matt.narr...@gmail.com] 
> Sent: Saturday, October 3, 2015 9:50 PM
> To: Mohammed Guller
> Cc: davidkl; user@spark.apache.org
> Subject: Re: laziness in textFile reading from HDFS?
> 
> Is there any more information or best practices here?  I have the exact same 
> issues when reading large data sets from HDFS (larger than available RAM) and 
> I cannot run without setting the RDD persistence level to 
> MEMORY_AND_DISK_SER, and using nearly all the cluster resources.
> 
> Should I repartition this RDD to be equal to the number of cores?  
> 
> I notice that the job duration on the YARN UI is about 30 minutes longer than 
> the Spark UI.  When the job initially starts, there is no tasks shown in the 
> Spark UI..?
> 
> All I;m doing is reading records from HDFS text files with sc.textFile, and 
> rewriting them back to HDFS grouped by a timestamp.
> 
> Thanks,
> mn
> 
>> On Sep 29, 2015, at 8:06 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
>> 
>> 1) It is not required to have the same amount of memory as data. 
>> 2) By default the # of partitions are equal to the number of HDFS 
>> blocks
>> 3) Yes, the read operation is lazy
>> 4) It is okay to have more number of partitions than number of cores. 
>> 
>> Mohammed
>> 
>> -----Original Message-----
>> From: davidkl [mailto:davidkl...@hotmail.com]
>> Sent: Monday, September 28, 2015 1:40 AM
>> To: user@spark.apache.org
>> Subject: laziness in textFile reading from HDFS?
>> 
>> Hello,
>> 
>> I need to process a significant amount of data every day, about 4TB. This 
>> will be processed in batches of about 140GB. The cluster this will be 
>> running on doesn't have enough memory to hold the dataset at once, so I am 
>> trying to understand how this works internally.
>> 
>> When using textFile to read an HDFS folder (containing multiple files), I 
>> understand that the number of partitions created are equal to the number of 
>> HDFS blocks, correct? Are those created in a lazy way? I mean, if the number 
>> of blocks/partitions is larger than the number of cores/threads the Spark 
>> driver was launched with (N), are N partitions created initially and then 
>> the rest when required? Or are all those partitions created up front?
>> 
>> I want to avoid reading the whole data into memory just to spill it out to 
>> disk if there is no enough memory.
>> 
>> Thanks! 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textFi
>> le-reading-from-HDFS-tp24837.html Sent from the Apache Spark User List 
>> mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
>> additional commands, e-mail: user-h...@spark.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
>> additional commands, e-mail: user-h...@spark.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to