Re: Which OutputCommitter to use for S3?

Mingyu Kim Mon, 23 Feb 2015 12:10:20 -0800

Cool, we will start from there. Thanks Aaron and Josh!

Darin, it¹s likely because the DirectOutputCommitter is compiled with
Hadoop 1 classes and you¹re running it with Hadoop 2.
org.apache.hadoop.mapred.JobContext used to be a class in Hadoop 1, and it
became an interface in Hadoop 2.


Mingyu





On 2/23/15, 11:52 AM, "Darin McBeath" <ddmcbe...@yahoo.com.INVALID> wrote:

>Aaron.  Thanks for the class. Since I'm currently writing Java based
>Spark applications, I tried converting your class to Java (it seemed
>pretty straightforward).
>
>I set up the use of the class as follows:
>
>SparkConf conf = new SparkConf()
>.set("spark.hadoop.mapred.output.committer.class",
>"com.elsevier.common.DirectOutputCommitter");
>
>And I then try and save a file to S3 (which I believe should use the old
>hadoop apis).
>
>JavaPairRDD<Text, Text> newBaselineRDDWritable =
>reducedhsfPairRDD.mapToPair(new ConvertToWritableTypes());
>newBaselineRDDWritable.saveAsHadoopFile(baselineOutputBucketFile,
>Text.class, Text.class, SequenceFileOutputFormat.class,
>org.apache.hadoop.io.compress.GzipCodec.class);
>
>But, I get the following error message.
>
>Exception in thread "main" java.lang.IncompatibleClassChangeError: Found
>class org.apache.hadoop.mapred.JobContext, but interface was expected
>at 
>com.elsevier.common.DirectOutputCommitter.commitJob(DirectOutputCommitter.
>java:68)
>at 
>org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:127)
>at 
>org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions
>.scala:1075)
>at 
>org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.sc
>ala:940)
>at 
>org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.sc
>ala:902)
>at 
>org.apache.spark.api.java.JavaPairRDD.saveAsHadoopFile(JavaPairRDD.scala:7
>71)
>at com.elsevier.spark.SparkSyncDedup.main(SparkSyncDedup.java:156)
>
>In my class, JobContext is an interface of  type
>org.apache.hadoop.mapred.JobContext.
>
>Is there something obvious that I might be doing wrong (or messed up in
>the translation from Scala to Java) or something I should look into?  I'm
>using Spark 1.2 with hadoop 2.4.
>
>
>Thanks.
>
>Darin.
>
>
>________________________________
>
>
>From: Aaron Davidson <ilike...@gmail.com>
>To: Andrew Ash <and...@andrewash.com>
>Cc: Josh Rosen <rosenvi...@gmail.com>; Mingyu Kim <m...@palantir.com>;
>"user@spark.apache.org" <user@spark.apache.org>; Aaron Davidson
><aa...@databricks.com>
>Sent: Saturday, February 21, 2015 7:01 PM
>Subject: Re: Which OutputCommitter to use for S3?
>
>
>
>Here is the class:
>https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_aaron
>dav_c513916e72101bbe14ec&d=AwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6o
>Onmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=_2YAVrYZtQmuKZRf6sFs
>zOvl_-ZnxmkBPHo1K24TfGE&s=cwSCPKlJO-BJcz4UcGck3xOE2N-4V3eoNvgtFCdMLP8&e=
>
>You can use it by setting "mapred.output.committer.class" in the Hadoop
>configuration (or "spark.hadoop.mapred.output.committer.class" in the
>Spark configuration). Note that this only works for the old Hadoop APIs,
>I believe the new Hadoop APIs strongly tie committer to input format (so
>FileInputFormat always uses FileOutputCommitter), which makes this fix
>more difficult to apply.
>
>
>
>
>On Sat, Feb 21, 2015 at 12:12 PM, Andrew Ash <and...@andrewash.com> wrote:
>
>Josh is that class something you guys would consider open sourcing, or
>would you rather the community step up and create an OutputCommitter
>implementation optimized for S3?
>>
>>
>>On Fri, Feb 20, 2015 at 4:02 PM, Josh Rosen <rosenvi...@gmail.com> wrote:
>>
>>We (Databricks) use our own DirectOutputCommitter implementation, which
>>is a couple tens of lines of Scala code.  The class would almost
>>entirely be a no-op except we took some care to properly handle the
>>_SUCCESS file.
>>>
>>>
>>>On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim <m...@palantir.com> wrote:
>>>
>>>I didn¹t get any response. It¹d be really appreciated if anyone using a
>>>special OutputCommitter for S3 can comment on this!
>>>>
>>>>
>>>>Thanks,
>>>>Mingyu
>>>>
>>>>
>>>>From: Mingyu Kim <m...@palantir.com>
>>>>Date: Monday, February 16, 2015 at 1:15 AM
>>>>To: "user@spark.apache.org" <user@spark.apache.org>
>>>>Subject: Which OutputCommitter to use for S3?
>>>>
>>>>
>>>>
>>>>HI all,
>>>>
>>>>
>>>>The default OutputCommitter used by RDD, which is FileOutputCommitter,
>>>>seems to require moving files at the commit step, which is not a
>>>>constant operation in S3, as discussed in
>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apa
>>>>che.org_mod-5Fmbox_spark-2Duser_201410.mbox_-253C543E33FA.2000802-40ent
>>>>ropy.be-253E&d=AwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=e
>>>>nnQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=_2YAVrYZtQmuKZRf6sFszOvl_-
>>>>ZnxmkBPHo1K24TfGE&s=EQOZaHRANJupdjXCfHSXL2t5BZ9YgMt2pRc3pht4o7o&e= .
>>>>People seem to develop their own NullOutputCommitter implementation or
>>>>use DirectFileOutputCommitter (as mentioned in SPARK-3595), but I
>>>>wanted to check if there is a de facto standard, publicly available
>>>>OutputCommitter to use for S3 in conjunction with Spark.
>>>>
>>>>
>>>>Thanks,
>>>>Mingyu
>>>
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Which OutputCommitter to use for S3?

Reply via email to