Use RDD.mapPartitions to go over all the items in a partition with one Mapper object. It will look something like this:
rdd.mapPartitions(iterator => val mapper = new myown.Mapper() mapper.configure(conf) val output = // {{create an OutputCollector that stores stuff in an ArrayBuffer}} for ((key, value) <- iterator) { mapper.map(key, value, output, Reporter.NULL) } output } On Jun 5, 2014, at 8:12 AM, Wei Tan <w...@us.ibm.com> wrote: > Thanks Matei. > > Using your pointers I can import data frrom HDFS, what I want to do now is > something like this in Spark: > > ----------------------- > import myown.mapper > > rdd.map (mapper.map) > ----------------------- > > The reason why I want this: myown.mapper is a java class I already developed. > I used to run it in Hadoop. It is fairly complex and relies on a lot of > utility java classes I wrote. Can I reuse the map function in java and port > it into Spark? > > Best regards, > Wei > > > --------------------------------- > Wei Tan, PhD > Research Staff Member > IBM T. J. Watson Research Center > http://researcher.ibm.com/person/us-wtan > > > > From: Matei Zaharia <matei.zaha...@gmail.com> > To: user@spark.apache.org, > Date: 06/04/2014 04:28 PM > Subject: Re: reuse hadoop code in Spark > > > > Yes, you can write some glue in Spark to call these. Some functions to look > at: > > - SparkContext.hadoopRDD lets you create an input RDD from an existing > JobConf configured by Hadoop (including InputFormat, paths, etc) > - RDD.mapPartitions lets you operate in all the values on one partition > (block) at a time, similar to how Mappers in MapReduce work > - PairRDDFunctions.reduceByKey and groupByKey can be used for aggregation. > - RDD.pipe() can be used to call out to a script or binary, like Hadoop > Streaming. > > A fair number of people have been running both Java and Hadoop Streaming apps > like this. > > Matei > > On Jun 4, 2014, at 1:08 PM, Wei Tan <w...@us.ibm.com> wrote: > > Hello, > > I am trying to use spark in such a scenario: > > I have code written in Hadoop and now I try to migrate to Spark. The mappers > and reducers are fairly complex. So I wonder if I can reuse the map() > functions I already wrote in Hadoop (Java), and use Spark to chain them, > mixing the Java map() functions with Spark operators? > > Another related question, can I use binary as operators, like Hadoop > streaming? > > Thanks! > Wei > > >