Re: Which OutputCommitter to use for S3?

Darin McBeath Mon, 23 Feb 2015 11:56:12 -0800

Aaron.  Thanks for the class. Since I'm currently writing Java based Spark 
applications, I tried converting your class to Java (it seemed pretty 
straightforward).


I set up the use of the class as follows:

SparkConf conf = new SparkConf()
.set("spark.hadoop.mapred.output.committer.class", 
"com.elsevier.common.DirectOutputCommitter");

And I then try and save a file to S3 (which I believe should use the old hadoop 
apis).

JavaPairRDD<Text, Text> newBaselineRDDWritable = 
reducedhsfPairRDD.mapToPair(new ConvertToWritableTypes());
newBaselineRDDWritable.saveAsHadoopFile(baselineOutputBucketFile, Text.class, 
Text.class, SequenceFileOutputFormat.class, 
org.apache.hadoop.io.compress.GzipCodec.class);

But, I get the following error message.

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class 
org.apache.hadoop.mapred.JobContext, but interface was expected
at 
com.elsevier.common.DirectOutputCommitter.commitJob(DirectOutputCommitter.java:68)
at org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:127)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1075)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:902)
at org.apache.spark.api.java.JavaPairRDD.saveAsHadoopFile(JavaPairRDD.scala:771)
at com.elsevier.spark.SparkSyncDedup.main(SparkSyncDedup.java:156)

In my class, JobContext is an interface of  type 
org.apache.hadoop.mapred.JobContext.

Is there something obvious that I might be doing wrong (or messed up in the 
translation from Scala to Java) or something I should look into?  I'm using 
Spark 1.2 with hadoop 2.4.


Thanks.

Darin.


________________________________


From: Aaron Davidson <ilike...@gmail.com>
To: Andrew Ash <and...@andrewash.com> 
Cc: Josh Rosen <rosenvi...@gmail.com>; Mingyu Kim <m...@palantir.com>; 
"user@spark.apache.org" <user@spark.apache.org>; Aaron Davidson 
<aa...@databricks.com> 
Sent: Saturday, February 21, 2015 7:01 PM
Subject: Re: Which OutputCommitter to use for S3?



Here is the class: https://gist.github.com/aarondav/c513916e72101bbe14ec

You can use it by setting "mapred.output.committer.class" in the Hadoop 
configuration (or "spark.hadoop.mapred.output.committer.class" in the Spark 
configuration). Note that this only works for the old Hadoop APIs, I believe 
the new Hadoop APIs strongly tie committer to input format (so FileInputFormat 
always uses FileOutputCommitter), which makes this fix more difficult to apply.




On Sat, Feb 21, 2015 at 12:12 PM, Andrew Ash <and...@andrewash.com> wrote:

Josh is that class something you guys would consider open sourcing, or would 
you rather the community step up and create an OutputCommitter implementation 
optimized for S3?
>
>
>On Fri, Feb 20, 2015 at 4:02 PM, Josh Rosen <rosenvi...@gmail.com> wrote:
>
>We (Databricks) use our own DirectOutputCommitter implementation, which is a 
>couple tens of lines of Scala code.  The class would almost entirely be a 
>no-op except we took some care to properly handle the _SUCCESS file.
>>
>>
>>On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim <m...@palantir.com> wrote:
>>
>>I didn’t get any response. It’d be really appreciated if anyone using a 
>>special OutputCommitter for S3 can comment on this!
>>>
>>>
>>>Thanks,
>>>Mingyu
>>>
>>>
>>>From: Mingyu Kim <m...@palantir.com>
>>>Date: Monday, February 16, 2015 at 1:15 AM
>>>To: "user@spark.apache.org" <user@spark.apache.org>
>>>Subject: Which OutputCommitter to use for S3?
>>>
>>>
>>>
>>>HI all,
>>>
>>>
>>>The default OutputCommitter used by RDD, which is FileOutputCommitter, seems 
>>>to require moving files at the commit step, which is not a constant 
>>>operation in S3, as discussed in 
>>>http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3c543e33fa.2000...@entropy.be%3E.
>>> People seem to develop their own NullOutputCommitter implementation or use 
>>>DirectFileOutputCommitter (as mentioned in SPARK-3595), but I wanted to 
>>>check if there is a de facto standard, publicly available OutputCommitter to 
>>>use for S3 in conjunction with Spark.
>>>
>>>
>>>Thanks,
>>>Mingyu
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Which OutputCommitter to use for S3?

Reply via email to