Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Nicholas Chammas
Davies, That’s pretty neat. I heard there was a pure Python clone of Spark out there—so you were one of the people behind it! I’ve created a JIRA issue about this. SPARK-3533: Add saveAsTextFileByKey() method to RDDs Sean, I think you might be

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Davies Liu
Maybe we should provide an API like saveTextFilesByKey(path), could you create an JIRA for it ? There is one in DPark [1] actually. [1] https://github.com/douban/dpark/blob/master/dpark/rdd.py#L309 On Mon, Sep 15, 2014 at 7:08 AM, Nicholas Chammas wrote: > Any tips from anybody on how to do thi

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Sean Owen
AFAIK there is no direct equivalent in Spark. You can cache or persist and RDD, and then run N separate operations to output different things from it, which is pretty close. I think you might be able to get this working with a subclass of MultipleTextOutputFormat, which overrides generateFileNameF

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Nicholas Chammas
Any tips from anybody on how to do this in PySpark? (Or regular Spark, for that matter.) On Sat, Sep 13, 2014 at 1:25 PM, Nick Chammas wrote: > Howdy doody Spark Users, > > I’d like to somehow write out a single RDD to multiple paths in one go. > Here’s an example. > > I have an RDD of (key, val