Re: File list read into single RDD

Nicholas Chammas Mon, 28 Apr 2014 20:09:07 -0700

Not that I know of. We were discussing it on another thread and it came up.


I think if you look up the Hadoop FileInputFormat API (which Spark uses)
you'll see it mentioned there in the docs.

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

But that's not obvious.

Nick

2014년 4월 28일 월요일, Pat Ferrel<pat.fer...@gmail.com>님이 작성한 메시지:

> Perfect.
>
> BTW just so I know where to look next time, was that in some docs?
>
> On Apr 28, 2014, at 7:04 PM, Nicholas Chammas 
> <nicholas.cham...@gmail.com<javascript:_e(%7B%7D,'cvml','nicholas.cham...@gmail.com');>>
> wrote:
>
> Yep, as I just found out, you can also provide sc.textFile() with a
> comma-delimited string of all the files you want to load.
>
> For example:
>
> sc.textFile('/path/to/file1,/path/to/file2')
>
> So once you have your list of files, concatenate their paths like that and
> pass the single string to textFile().
>
> Nick
>
>
> On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel 
> <pat.fer...@gmail.com<javascript:_e(%7B%7D,'cvml','pat.fer...@gmail.com');>
> > wrote:
>
>> sc.textFile(URI) supports reading multiple files in parallel but only
>> with a wildcard. I need to walk a dir tree, match a regex to create a list
>> of files, then I’d like to read them into a single RDD in parallel. I
>> understand these could go into separate RDDs then a union RDD can be
>> created. Is there a way to create a single RDD from a URI list?
>
>
>
>

Re: File list read into single RDD

Reply via email to