It will certainly cause bad performance, since it reads the whole content
of a large file into one value, instead of splitting it into partitions.

Typically one file is 1 GB. Suppose we have 3 large files, in this way,
there would only be 3 key-value pairs, and thus 3 tasks at most.


2014-07-30 12:49 GMT+08:00 Hossein <fal...@gmail.com>:

> You can use SparkContext.wholeTextFile().
>
> Please note that the documentation suggests: "Small files are preferred,
> large file is also allowable, but may cause bad performance."
>
> --Hossein
>
>
> On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> This is an interesting question. I’m curious to know as well how this
>> problem can be approached.
>>
>> Is there a way, perhaps, to ensure that each input file matching the glob
>> expression gets mapped to exactly one partition? Then you could probably
>> get what you want using RDD.mapPartitions().
>>
>> Nick
>> ​
>>
>>
>> On Wed, Jul 30, 2014 at 12:02 AM, Fengyun RAO <raofeng...@gmail.com>
>> wrote:
>>
>>> Hi, all
>>>
>>> We are migrating from mapreduce to spark, and encountered a problem.
>>>
>>> Our input files are IIS logs with file head. It's easy to get the file
>>> head if we process only one file, e.g.
>>>
>>> val lines = sc.textFile('hdfs://*/u_ex14073011.log')
>>> val head = lines.take(4)
>>>
>>> Then we can write our map method using this head.
>>>
>>> However, if we input multiple files, each of which may have a different
>>> file head, how can we get file head for each partition?
>>>
>>> It seems we have two options:
>>>
>>> 1. still use textFile() to get lines.
>>>
>>> Since each partition may have a different "head", we have to write
>>> mapPartitionsWithContext method. However we can't find a way to get the
>>> "head" for each partition.
>>>
>>> In our former mapreduce program, we could simply use
>>>
>>> Path path = ((FileSplit) context.getInputSplit()).getPath()
>>>
>>> but there seems no way in spark, since HadoopPartition which wraps
>>> InputSplit inside HadoopRDD is a private class.
>>>
>>> 2. use wholeTextFile() to get whole contents.
>>>
>>>  It's easy to get file head for each file, but according to the
>>> document, this API is better for small files.
>>>
>>>
>>> *Any suggestions on how to process these files with heads?*
>>>
>>
>>
>

Reply via email to