Re: File list read into single RDD

2014-05-21 Thread Pat Ferrel
Thanks this really helps. As long as I stick to HDFS paths, and files I’m good. I do know that code a bit but have never used it to say take input from one cluster via “hdfs://server:port/path” and output to another via “hdfs://another-server:another-port/path”. This seems to be supported by S

Re: File list read into single RDD

2014-05-18 Thread Andrew Ash
Spark's sc.textFile() method delegates to sc.hadoopFile(), which uses Hadoop's FileInputFormat.setInputPaths()

Re: File list read into single RDD

2014-05-18 Thread Pat Ferrel
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since Spark supports several FS schemes I’m unclear about how much to assume about using the hadoop file systems APIs and conventions. Concretely if I pass a pattern in with a HTTPS file system, will the pattern work? How

Re: File list read into single RDD

2014-04-29 Thread Christophe Préaud
Hi, You can also use any path pattern as defined here: http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 e.g.: sc.textFile('{/path/to/file1,/path/to/file2}') Christophe. On 29/04/2014 05:07, Nicholas Chammas wrote: Not tha

Re: File list read into single RDD

2014-04-28 Thread Nicholas Chammas
Not that I know of. We were discussing it on another thread and it came up. I think if you look up the Hadoop FileInputFormat API (which Spark uses) you'll see it mentioned there in the docs. http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html But that's not ob

Re: File list read into single RDD

2014-04-28 Thread Pat Ferrel
Perfect. BTW just so I know where to look next time, was that in some docs? On Apr 28, 2014, at 7:04 PM, Nicholas Chammas wrote: Yep, as I just found out, you can also provide sc.textFile() with a comma-delimited string of all the files you want to load. For example: sc.textFile('/path/to/

Re: File list read into single RDD

2014-04-28 Thread Nicholas Chammas
Yep, as I just found out, you can also provide sc.textFile() with a comma-delimited string of all the files you want to load. For example: sc.textFile('/path/to/file1,/path/to/file2') So once you have your list of files, concatenate their paths like that and pass the single string to textFile().

File list read into single RDD

2014-04-28 Thread Pat Ferrel
sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex to create a list of files, then I’d like to read them into a single RDD in parallel. I understand these could go into separate RDDs then a union RDD can be created. Is