Yes, unfortunately that direct dependency makes this injection much more difficult for saveAsParquetFile.
On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee <pl...@appier.com> wrote: > Thanks for the DirectOutputCommitter example. > However I found it only works for saveAsHadoopFile. What about > saveAsParquetFile? > It looks like SparkSQL is using ParquetOutputCommitter, which is subclass > of FileOutputCommitter. > > On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor < > thomas.dem...@amplidata.com> > wrote: > > > FYI. We're currently addressing this at the Hadoop level in > > https://issues.apache.org/jira/browse/HADOOP-9565 > > > > > > Thomas Demoor > > > > On Mon, Feb 23, 2015 at 10:16 PM, Darin McBeath < > > ddmcbe...@yahoo.com.invalid> wrote: > > > >> Just to close the loop in case anyone runs into the same problem I had. > >> > >> By setting --hadoop-major-version=2 when using the ec2 scripts, > >> everything worked fine. > >> > >> Darin. > >> > >> > >> ----- Original Message ----- > >> From: Darin McBeath <ddmcbe...@yahoo.com.INVALID> > >> To: Mingyu Kim <m...@palantir.com>; Aaron Davidson <ilike...@gmail.com> > >> Cc: "user@spark.apache.org" <user@spark.apache.org> > >> Sent: Monday, February 23, 2015 3:16 PM > >> Subject: Re: Which OutputCommitter to use for S3? > >> > >> Thanks. I think my problem might actually be the other way around. > >> > >> I'm compiling with hadoop 2, but when I startup Spark, using the ec2 > >> scripts, I don't specify a > >> -hadoop-major-version and the default is 1. I'm guessing that if I > make > >> that a 2 that it might work correctly. I'll try it and post a response. > >> > >> > >> ----- Original Message ----- > >> From: Mingyu Kim <m...@palantir.com> > >> To: Darin McBeath <ddmcbe...@yahoo.com>; Aaron Davidson < > >> ilike...@gmail.com> > >> Cc: "user@spark.apache.org" <user@spark.apache.org> > >> Sent: Monday, February 23, 2015 3:06 PM > >> Subject: Re: Which OutputCommitter to use for S3? > >> > >> Cool, we will start from there. Thanks Aaron and Josh! > >> > >> Darin, it¹s likely because the DirectOutputCommitter is compiled with > >> Hadoop 1 classes and you¹re running it with Hadoop 2. > >> org.apache.hadoop.mapred.JobContext used to be a class in Hadoop 1, and > it > >> became an interface in Hadoop 2. > >> > >> Mingyu > >> > >> > >> > >> > >> > >> On 2/23/15, 11:52 AM, "Darin McBeath" <ddmcbe...@yahoo.com.INVALID> > >> wrote: > >> > >> >Aaron. Thanks for the class. Since I'm currently writing Java based > >> >Spark applications, I tried converting your class to Java (it seemed > >> >pretty straightforward). > >> > > >> >I set up the use of the class as follows: > >> > > >> >SparkConf conf = new SparkConf() > >> >.set("spark.hadoop.mapred.output.committer.class", > >> >"com.elsevier.common.DirectOutputCommitter"); > >> > > >> >And I then try and save a file to S3 (which I believe should use the > old > >> >hadoop apis). > >> > > >> >JavaPairRDD<Text, Text> newBaselineRDDWritable = > >> >reducedhsfPairRDD.mapToPair(new ConvertToWritableTypes()); > >> >newBaselineRDDWritable.saveAsHadoopFile(baselineOutputBucketFile, > >> >Text.class, Text.class, SequenceFileOutputFormat.class, > >> >org.apache.hadoop.io.compress.GzipCodec.class); > >> > > >> >But, I get the following error message. > >> > > >> >Exception in thread "main" java.lang.IncompatibleClassChangeError: > Found > >> >class org.apache.hadoop.mapred.JobContext, but interface was expected > >> >at > >> > >> > >com.elsevier.common.DirectOutputCommitter.commitJob(DirectOutputCommitter. > >> >java:68) > >> >at > >> > >org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:127) > >> >at > >> > >> > >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions > >> >.scala:1075) > >> >at > >> > >> > >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.sc > >> >ala:940) > >> >at > >> > >> > >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.sc > >> >ala:902) > >> >at > >> > >> > >org.apache.spark.api.java.JavaPairRDD.saveAsHadoopFile(JavaPairRDD.scala:7 > >> >71) > >> >at com.elsevier.spark.SparkSyncDedup.main(SparkSyncDedup.java:156) > >> > > >> >In my class, JobContext is an interface of type > >> >org.apache.hadoop.mapred.JobContext. > >> > > >> >Is there something obvious that I might be doing wrong (or messed up in > >> >the translation from Scala to Java) or something I should look into? > I'm > >> >using Spark 1.2 with hadoop 2.4. > >> > > >> > > >> >Thanks. > >> > > >> >Darin. > >> > > >> > > >> >________________________________ > >> > > >> > > >> >From: Aaron Davidson <ilike...@gmail.com> > >> >To: Andrew Ash <and...@andrewash.com> > >> >Cc: Josh Rosen <rosenvi...@gmail.com>; Mingyu Kim <m...@palantir.com>; > >> >"user@spark.apache.org" <user@spark.apache.org>; Aaron Davidson > >> ><aa...@databricks.com> > >> >Sent: Saturday, February 21, 2015 7:01 PM > >> >Subject: Re: Which OutputCommitter to use for S3? > >> > > >> > > >> > > >> >Here is the class: > >> > > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_aaron > >> > >> > >dav_c513916e72101bbe14ec&d=AwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6o > >> > >> > >Onmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=_2YAVrYZtQmuKZRf6sFs > >> > >zOvl_-ZnxmkBPHo1K24TfGE&s=cwSCPKlJO-BJcz4UcGck3xOE2N-4V3eoNvgtFCdMLP8&e= > >> > > >> >You can use it by setting "mapred.output.committer.class" in the Hadoop > >> >configuration (or "spark.hadoop.mapred.output.committer.class" in the > >> >Spark configuration). Note that this only works for the old Hadoop > APIs, > >> >I believe the new Hadoop APIs strongly tie committer to input format > (so > >> >FileInputFormat always uses FileOutputCommitter), which makes this fix > >> >more difficult to apply. > >> > > >> > > >> > > >> > > >> >On Sat, Feb 21, 2015 at 12:12 PM, Andrew Ash <and...@andrewash.com> > >> wrote: > >> > > >> >Josh is that class something you guys would consider open sourcing, or > >> >would you rather the community step up and create an OutputCommitter > >> >implementation optimized for S3? > >> >> > >> >> > >> >>On Fri, Feb 20, 2015 at 4:02 PM, Josh Rosen <rosenvi...@gmail.com> > >> wrote: > >> >> > >> >>We (Databricks) use our own DirectOutputCommitter implementation, > which > >> >>is a couple tens of lines of Scala code. The class would almost > >> >>entirely be a no-op except we took some care to properly handle the > >> >>_SUCCESS file. > >> >>> > >> >>> > >> >>>On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim <m...@palantir.com> > wrote: > >> >>> > >> >>>I didn¹t get any response. It¹d be really appreciated if anyone > using a > >> >>>special OutputCommitter for S3 can comment on this! > >> >>>> > >> >>>> > >> >>>>Thanks, > >> >>>>Mingyu > >> >>>> > >> >>>> > >> >>>>From: Mingyu Kim <m...@palantir.com> > >> >>>>Date: Monday, February 16, 2015 at 1:15 AM > >> >>>>To: "user@spark.apache.org" <user@spark.apache.org> > >> >>>>Subject: Which OutputCommitter to use for S3? > >> >>>> > >> >>>> > >> >>>> > >> >>>>HI all, > >> >>>> > >> >>>> > >> >>>>The default OutputCommitter used by RDD, which is > FileOutputCommitter, > >> >>>>seems to require moving files at the commit step, which is not a > >> >>>>constant operation in S3, as discussed in > >> >>>> > >> https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apa > >> > >> > >>>>che.org_mod-5Fmbox_spark-2Duser_201410.mbox_-253C543E33FA.2000802-40ent > >> > >> > >>>>ropy.be-253E&d=AwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=e > >> > >> > >>>>nnQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=_2YAVrYZtQmuKZRf6sFszOvl_- > >> >>>>ZnxmkBPHo1K24TfGE&s=EQOZaHRANJupdjXCfHSXL2t5BZ9YgMt2pRc3pht4o7o&e= . > >> > >> >>>>People seem to develop their own NullOutputCommitter implementation > or > >> >>>>use DirectFileOutputCommitter (as mentioned in SPARK-3595), but I > >> >>>>wanted to check if there is a de facto standard, publicly available > >> >>>>OutputCommitter to use for S3 in conjunction with Spark. > >> >>>> > >> >>>> > >> >>>>Thanks, > >> >>>>Mingyu > >> >>> > >> >> > >> > > >> >--------------------------------------------------------------------- > >> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> >For additional commands, e-mail: user-h...@spark.apache.org > >> > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> For additional commands, e-mail: user-h...@spark.apache.org > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> For additional commands, e-mail: user-h...@spark.apache.org > >> > >> > > >