Re: processing s3n:// files in parallel

2014-04-30 Thread foundart
Thanks, Andrew. As it turns out, the tasks were getting processed in parallel in separate threads on the same node. Using the parallel collection of hadoop files was sufficient to trigger that but my expectation that the tasks would be spread across nodes rather than cores on a single node led me

Re: processing s3n:// files in parallel

2014-04-28 Thread Nicholas Chammas
Oh snap! I didn’t know that! Confirmed that both the wildcard syntax and the comma-separated syntax work in PySpark. For example: sc.textFile('s3n://file1,s3n://file2').count() Art, Would this approach work for you? It would let you load your 3 files into a single RDD, which your workers could

Re: processing s3n:// files in parallel

2014-04-28 Thread Matei Zaharia
Actually wildcards work too, e.g. s3n://bucket/file1*, and I believe so do comma-separated lists (e.g. s3n://file1,s3n://file2). These are all inherited from FileInputFormat in Hadoop. Matei On Apr 28, 2014, at 6:05 PM, Andrew Ash wrote: > This is already possible with the sc.textFile("/path/

Re: processing s3n:// files in parallel

2014-04-28 Thread Andrew Ash
This is already possible with the sc.textFile("/path/to/filedir/*") call. Does that work for you? Sent from my mobile phone On Apr 29, 2014 2:46 AM, "Nicholas Chammas" wrote: > It would be useful to have some way to open multiple files at once into a > single RDD (e.g. sc.textFile(iterable_over_

Re: processing s3n:// files in parallel

2014-04-28 Thread Nicholas Chammas
It would be useful to have some way to open multiple files at once into a single RDD (e.g. sc.textFile(iterable_over_uris)). Logically, it would be equivalent to opening a single file which is made by concatenating the various files together. This would only be useful, of course, if the source file

Re: processing s3n:// files in parallel

2014-04-28 Thread Andrew Ash
The way you've written it there, I would expect that to be serial runs. The reason is, you're looping over matches with a driver-level map, which is serialized. Then calling foreachWith on the RDDs executes the action in a blocking way, so you don't get a result back until the cluster finishes.