One. I read in LZO compressed files from HDFS Perform a map operation cache the results of this map operation call saveAsHadoopFile to write LZO back to HDFS.
Without the cache, the job will stall. mn > On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com> wrote: > > Is there any specific reason for caching the RDD? How many passes you make > over the dataset? > > Mohammed > > -----Original Message----- > From: Matt Narrell [mailto:matt.narr...@gmail.com] > Sent: Saturday, October 3, 2015 9:50 PM > To: Mohammed Guller > Cc: davidkl; user@spark.apache.org > Subject: Re: laziness in textFile reading from HDFS? > > Is there any more information or best practices here? I have the exact same > issues when reading large data sets from HDFS (larger than available RAM) and > I cannot run without setting the RDD persistence level to > MEMORY_AND_DISK_SER, and using nearly all the cluster resources. > > Should I repartition this RDD to be equal to the number of cores? > > I notice that the job duration on the YARN UI is about 30 minutes longer than > the Spark UI. When the job initially starts, there is no tasks shown in the > Spark UI..? > > All I;m doing is reading records from HDFS text files with sc.textFile, and > rewriting them back to HDFS grouped by a timestamp. > > Thanks, > mn > >> On Sep 29, 2015, at 8:06 PM, Mohammed Guller <moham...@glassbeam.com> wrote: >> >> 1) It is not required to have the same amount of memory as data. >> 2) By default the # of partitions are equal to the number of HDFS >> blocks >> 3) Yes, the read operation is lazy >> 4) It is okay to have more number of partitions than number of cores. >> >> Mohammed >> >> -----Original Message----- >> From: davidkl [mailto:davidkl...@hotmail.com] >> Sent: Monday, September 28, 2015 1:40 AM >> To: user@spark.apache.org >> Subject: laziness in textFile reading from HDFS? >> >> Hello, >> >> I need to process a significant amount of data every day, about 4TB. This >> will be processed in batches of about 140GB. The cluster this will be >> running on doesn't have enough memory to hold the dataset at once, so I am >> trying to understand how this works internally. >> >> When using textFile to read an HDFS folder (containing multiple files), I >> understand that the number of partitions created are equal to the number of >> HDFS blocks, correct? Are those created in a lazy way? I mean, if the number >> of blocks/partitions is larger than the number of cores/threads the Spark >> driver was launched with (N), are N partitions created initially and then >> the rest when required? Or are all those partitions created up front? >> >> I want to avoid reading the whole data into memory just to spill it out to >> disk if there is no enough memory. >> >> Thanks! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textFi >> le-reading-from-HDFS-tp24837.html Sent from the Apache Spark User List >> mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >> additional commands, e-mail: user-h...@spark.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >> additional commands, e-mail: user-h...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org