Would it be right to assume that the silence on this topic implies others don't really have this issue/desire?
On Sat, 18 Jul 2015 at 17:24 Silas Davis <[email protected]> wrote: > *tl;dr hadoop and cascading* *provide ways of writing tuples to multiple > output files based on key, but the plain RDD interface doesn't seem to and > it should.* > > I have been looking into ways to write to multiple outputs in Spark. It > seems like a feature that is somewhat missing from Spark. > > The idea is to partition output and write the elements of an RDD to > different locations depending based on the key. For example in a pair RDD > your key may be (language, date, userId) and you would like to write > separate files to $someBasePath/$language/$date. Then there would be a > version of saveAsHadoopDataset that would be able to multiple location > based on key using the underlying OutputFormat. Perahps it would take a > pair RDD with keys ($partitionKey, $realKey), so for example ((language, > date), userId). > > The prior art I have found on this is the following. > > Using SparkSQL: > The 'partitionBy' method of DataFrameWriter: > https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter > > This only works for parquet at the moment. > > Using Spark/Hadoop: > This pull request (with the hadoop1 API,) : > https://github.com/apache/spark/pull/4895/files. > > This uses MultipleTextOutputFormat (which in turn uses > MultipleOutputFormat) which is part of the old hadoop1 API. It only works > for text but could be generalised for any underlying OutputFormat by using > MultipleOutputFormat (but only for hadoop1 - which doesn't support > ParquetAvroOutputFormat for example) > > This gist (With the hadoop2 API): > https://gist.github.com/mlehman/df9546f6be2e362bbad2 > > This uses MultipleOutputs (available for both the old and new hadoop APIs) > and extends saveAsNewHadoopDataset to support multiple outputs. Should work > for any underlying OutputFormat. Probably better implemented by extending > saveAs[NewAPI]HadoopDataset. > > In Cascading: > Cascading provides PartititionTap: > http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html > to do this > > So my questions are: is there a reason why Spark doesn't provide this? > Does Spark provide similar functionality through some other mechanism? How > would it be best implemented? > > Since I started composing this message I've had a go at writing an wrapper > OutputFormat that writes multiple outputs using hadoop MultipleOutputs but > doesn't require modification of the PairRDDFunctions. The principle is > similar however. Again it feels slightly hacky to use dummy fields for the > ReduceContextImpl, but some of this may be a part of the impedance mismatch > between Spark and plain Hadoop... Here is my attempt: > https://gist.github.com/silasdavis/d1d1f1f7ab78249af462 > > I'd like to see this functionality in Spark somehow but invite suggestion > of how best to achieve it. > > Thanks, > Silas >
