You can use SparkContext.wholeTextFile().

Please note that the documentation suggests: "Small files are preferred,
large file is also allowable, but may cause bad performance."

--Hossein


On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> This is an interesting question. I’m curious to know as well how this
> problem can be approached.
>
> Is there a way, perhaps, to ensure that each input file matching the glob
> expression gets mapped to exactly one partition? Then you could probably
> get what you want using RDD.mapPartitions().
>
> Nick
> ​
>
>
> On Wed, Jul 30, 2014 at 12:02 AM, Fengyun RAO <raofeng...@gmail.com>
> wrote:
>
>> Hi, all
>>
>> We are migrating from mapreduce to spark, and encountered a problem.
>>
>> Our input files are IIS logs with file head. It's easy to get the file
>> head if we process only one file, e.g.
>>
>> val lines = sc.textFile('hdfs://*/u_ex14073011.log')
>> val head = lines.take(4)
>>
>> Then we can write our map method using this head.
>>
>> However, if we input multiple files, each of which may have a different
>> file head, how can we get file head for each partition?
>>
>> It seems we have two options:
>>
>> 1. still use textFile() to get lines.
>>
>> Since each partition may have a different "head", we have to write
>> mapPartitionsWithContext method. However we can't find a way to get the
>> "head" for each partition.
>>
>> In our former mapreduce program, we could simply use
>>
>> Path path = ((FileSplit) context.getInputSplit()).getPath()
>>
>> but there seems no way in spark, since HadoopPartition which wraps
>> InputSplit inside HadoopRDD is a private class.
>>
>> 2. use wholeTextFile() to get whole contents.
>>
>>  It's easy to get file head for each file, but according to the document,
>> this API is better for small files.
>>
>>
>> *Any suggestions on how to process these files with heads?*
>>
>
>

Reply via email to