You can use SparkContext.wholeTextFile(). Please note that the documentation suggests: "Small files are preferred, large file is also allowable, but may cause bad performance."
--Hossein On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > This is an interesting question. I’m curious to know as well how this > problem can be approached. > > Is there a way, perhaps, to ensure that each input file matching the glob > expression gets mapped to exactly one partition? Then you could probably > get what you want using RDD.mapPartitions(). > > Nick > > > > On Wed, Jul 30, 2014 at 12:02 AM, Fengyun RAO <raofeng...@gmail.com> > wrote: > >> Hi, all >> >> We are migrating from mapreduce to spark, and encountered a problem. >> >> Our input files are IIS logs with file head. It's easy to get the file >> head if we process only one file, e.g. >> >> val lines = sc.textFile('hdfs://*/u_ex14073011.log') >> val head = lines.take(4) >> >> Then we can write our map method using this head. >> >> However, if we input multiple files, each of which may have a different >> file head, how can we get file head for each partition? >> >> It seems we have two options: >> >> 1. still use textFile() to get lines. >> >> Since each partition may have a different "head", we have to write >> mapPartitionsWithContext method. However we can't find a way to get the >> "head" for each partition. >> >> In our former mapreduce program, we could simply use >> >> Path path = ((FileSplit) context.getInputSplit()).getPath() >> >> but there seems no way in spark, since HadoopPartition which wraps >> InputSplit inside HadoopRDD is a private class. >> >> 2. use wholeTextFile() to get whole contents. >> >> It's easy to get file head for each file, but according to the document, >> this API is better for small files. >> >> >> *Any suggestions on how to process these files with heads?* >> > >