Aaron. Thanks for the class. Since I'm currently writing Java based Spark
applications, I tried converting your class to Java (it seemed pretty
straightforward).
I set up the use of the class as follows:
SparkConf conf = new SparkConf()
.set("spark.hadoop.mapred.output.committer.class",
"com.elsevier.common.DirectOutputCommitter");
And I then try and save a file to S3 (which I believe should use the old hadoop
apis).
JavaPairRDD<Text, Text> newBaselineRDDWritable =
reducedhsfPairRDD.mapToPair(new ConvertToWritableTypes());
newBaselineRDDWritable.saveAsHadoopFile(baselineOutputBucketFile, Text.class,
Text.class, SequenceFileOutputFormat.class,
org.apache.hadoop.io.compress.GzipCodec.class);
But, I get the following error message.
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class
org.apache.hadoop.mapred.JobContext, but interface was expected
at
com.elsevier.common.DirectOutputCommitter.commitJob(DirectOutputCommitter.java:68)
at org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:127)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1075)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:902)
at org.apache.spark.api.java.JavaPairRDD.saveAsHadoopFile(JavaPairRDD.scala:771)
at com.elsevier.spark.SparkSyncDedup.main(SparkSyncDedup.java:156)
In my class, JobContext is an interface of type
org.apache.hadoop.mapred.JobContext.
Is there something obvious that I might be doing wrong (or messed up in the
translation from Scala to Java) or something I should look into? I'm using
Spark 1.2 with hadoop 2.4.
Thanks.
Darin.
________________________________
From: Aaron Davidson <[email protected]>
To: Andrew Ash <[email protected]>
Cc: Josh Rosen <[email protected]>; Mingyu Kim <[email protected]>;
"[email protected]" <[email protected]>; Aaron Davidson
<[email protected]>
Sent: Saturday, February 21, 2015 7:01 PM
Subject: Re: Which OutputCommitter to use for S3?
Here is the class: https://gist.github.com/aarondav/c513916e72101bbe14ec
You can use it by setting "mapred.output.committer.class" in the Hadoop
configuration (or "spark.hadoop.mapred.output.committer.class" in the Spark
configuration). Note that this only works for the old Hadoop APIs, I believe
the new Hadoop APIs strongly tie committer to input format (so FileInputFormat
always uses FileOutputCommitter), which makes this fix more difficult to apply.
On Sat, Feb 21, 2015 at 12:12 PM, Andrew Ash <[email protected]> wrote:
Josh is that class something you guys would consider open sourcing, or would
you rather the community step up and create an OutputCommitter implementation
optimized for S3?
>
>
>On Fri, Feb 20, 2015 at 4:02 PM, Josh Rosen <[email protected]> wrote:
>
>We (Databricks) use our own DirectOutputCommitter implementation, which is a
>couple tens of lines of Scala code. The class would almost entirely be a
>no-op except we took some care to properly handle the _SUCCESS file.
>>
>>
>>On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim <[email protected]> wrote:
>>
>>I didn’t get any response. It’d be really appreciated if anyone using a
>>special OutputCommitter for S3 can comment on this!
>>>
>>>
>>>Thanks,
>>>Mingyu
>>>
>>>
>>>From: Mingyu Kim <[email protected]>
>>>Date: Monday, February 16, 2015 at 1:15 AM
>>>To: "[email protected]" <[email protected]>
>>>Subject: Which OutputCommitter to use for S3?
>>>
>>>
>>>
>>>HI all,
>>>
>>>
>>>The default OutputCommitter used by RDD, which is FileOutputCommitter, seems
>>>to require moving files at the commit step, which is not a constant
>>>operation in S3, as discussed in
>>>http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%[email protected]%3E.
>>> People seem to develop their own NullOutputCommitter implementation or use
>>>DirectFileOutputCommitter (as mentioned in SPARK-3595), but I wanted to
>>>check if there is a de facto standard, publicly available OutputCommitter to
>>>use for S3 in conjunction with Spark.
>>>
>>>
>>>Thanks,
>>>Mingyu
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]