Re: File list read into single RDD

Christophe Préaud Tue, 29 Apr 2014 00:56:32 -0700

Hi,

You can also use any path pattern as defined here: 
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29


e.g.:

sc.textFile('{/path/to/file1,/path/to/file2}')

Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:
Not that I know of. We were discussing it on another thread and it came up.

I think if you look up the Hadoop FileInputFormat API (which Spark uses) you'll 
see it mentioned there in the docs.

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

But that's not obvious.

Nick

2014년 4월 28일 월요일, Pat Ferrel<[email protected]<mailto:[email protected]>> 
님이 작성한 메시지:
Perfect.

BTW just so I know where to look next time, was that in some docs?

On Apr 28, 2014, at 7:04 PM, Nicholas Chammas 
<[email protected]<javascript:_e(%7B%7D,'cvml','[email protected]');>>
 wrote:


Yep, as I just found out, you can also provide sc.textFile() with a 
comma-delimited string of all the files you want to load.

For example:

sc.textFile('/path/to/file1,/path/to/file2')


So once you have your list of files, concatenate their paths like that and pass 
the single string to textFile().

Nick


On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel 
<[email protected]<javascript:_e(%7B%7D,'cvml','[email protected]');>> 
wrote:
sc.textFile(URI) supports reading multiple files in parallel but only with a 
wildcard. I need to walk a dir tree, match a regex to create a list of files, 
then I’d like to read them into a single RDD in parallel. I understand these 
could go into separate RDDs then a union RDD can be created. Is there a way to 
create a single RDD from a URI list?




________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

Re: File list read into single RDD

Reply via email to