It will certainly cause bad performance, since it reads the whole content of a large file into one value, instead of splitting it into partitions.
Typically one file is 1 GB. Suppose we have 3 large files, in this way, there would only be 3 key-value pairs, and thus 3 tasks at most. 2014-07-30 12:49 GMT+08:00 Hossein <fal...@gmail.com>: > You can use SparkContext.wholeTextFile(). > > Please note that the documentation suggests: "Small files are preferred, > large file is also allowable, but may cause bad performance." > > --Hossein > > > On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> This is an interesting question. I’m curious to know as well how this >> problem can be approached. >> >> Is there a way, perhaps, to ensure that each input file matching the glob >> expression gets mapped to exactly one partition? Then you could probably >> get what you want using RDD.mapPartitions(). >> >> Nick >> >> >> >> On Wed, Jul 30, 2014 at 12:02 AM, Fengyun RAO <raofeng...@gmail.com> >> wrote: >> >>> Hi, all >>> >>> We are migrating from mapreduce to spark, and encountered a problem. >>> >>> Our input files are IIS logs with file head. It's easy to get the file >>> head if we process only one file, e.g. >>> >>> val lines = sc.textFile('hdfs://*/u_ex14073011.log') >>> val head = lines.take(4) >>> >>> Then we can write our map method using this head. >>> >>> However, if we input multiple files, each of which may have a different >>> file head, how can we get file head for each partition? >>> >>> It seems we have two options: >>> >>> 1. still use textFile() to get lines. >>> >>> Since each partition may have a different "head", we have to write >>> mapPartitionsWithContext method. However we can't find a way to get the >>> "head" for each partition. >>> >>> In our former mapreduce program, we could simply use >>> >>> Path path = ((FileSplit) context.getInputSplit()).getPath() >>> >>> but there seems no way in spark, since HadoopPartition which wraps >>> InputSplit inside HadoopRDD is a private class. >>> >>> 2. use wholeTextFile() to get whole contents. >>> >>> It's easy to get file head for each file, but according to the >>> document, this API is better for small files. >>> >>> >>> *Any suggestions on how to process these files with heads?* >>> >> >> >