Re: Writing to multiple outputs in Spark

Silas Davis Fri, 14 Aug 2015 07:57:17 -0700

Would it be right to assume that the silence on this topic implies others
don't really have this issue/desire?


On Sat, 18 Jul 2015 at 17:24 Silas Davis <[email protected]> wrote:

> *tl;dr hadoop and cascading* *provide ways of writing tuples to multiple
> output files based on key, but the plain RDD interface doesn't seem to and
> it should.*
>
> I have been looking into ways to write to multiple outputs in Spark. It
> seems like a feature that is somewhat missing from Spark.
>
> The idea is to partition output and write the elements of an RDD to
> different locations depending based on the key. For example in a pair RDD
> your key may be (language, date, userId) and you would like to write
> separate files to $someBasePath/$language/$date. Then there would be  a
> version of saveAsHadoopDataset that would be able to multiple location
> based on key using the underlying OutputFormat. Perahps it would take a
> pair RDD with keys ($partitionKey, $realKey), so for example ((language,
> date), userId).
>
> The prior art I have found on this is the following.
>
> Using SparkSQL:
> The 'partitionBy' method of DataFrameWriter:
> https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
>
> This only works for parquet at the moment.
>
> Using Spark/Hadoop:
> This pull request (with the hadoop1 API,) :
> https://github.com/apache/spark/pull/4895/files.
>
> This uses MultipleTextOutputFormat (which in turn uses
> MultipleOutputFormat) which is part of the old hadoop1 API. It only works
> for text but could be generalised for any underlying OutputFormat by using
> MultipleOutputFormat (but only for hadoop1 - which doesn't support
> ParquetAvroOutputFormat for example)
>
> This gist (With the hadoop2 API):
> https://gist.github.com/mlehman/df9546f6be2e362bbad2
>
> This uses MultipleOutputs (available for both the old and new hadoop APIs)
> and extends saveAsNewHadoopDataset to support multiple outputs. Should work
> for any underlying OutputFormat. Probably better implemented by extending
> saveAs[NewAPI]HadoopDataset.
>
> In Cascading:
> Cascading provides PartititionTap:
> http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html
> to do this
>
> So my questions are: is there a reason why Spark doesn't provide this?
> Does Spark provide similar functionality through some other mechanism? How
> would it be best implemented?
>
> Since I started composing this message I've had a go at writing an wrapper
> OutputFormat that writes multiple outputs using hadoop MultipleOutputs but
> doesn't require modification of the PairRDDFunctions. The principle is
> similar however. Again it feels slightly hacky to use dummy fields for the
> ReduceContextImpl, but some of this may be a part of the impedance mismatch
> between Spark and plain Hadoop... Here is my attempt:
> https://gist.github.com/silasdavis/d1d1f1f7ab78249af462
>
> I'd like to see this functionality in Spark somehow but invite suggestion
> of how best to achieve it.
>
> Thanks,
> Silas
>

Re: Writing to multiple outputs in Spark

Reply via email to