Re: reuse hadoop code in Spark

Matei Zaharia Thu, 05 Jun 2014 10:35:11 -0700

Use RDD.mapPartitions to go over all the items in a partition with one Mapper 
object. It will look something like this:


rdd.mapPartitions(iterator =>
  val mapper = new myown.Mapper()
  mapper.configure(conf)
  val output = // {{create an OutputCollector that stores stuff in an 
ArrayBuffer}}
  for ((key, value) <- iterator) {
    mapper.map(key, value, output, Reporter.NULL)
  }
  output
}

On Jun 5, 2014, at 8:12 AM, Wei Tan <w...@us.ibm.com> wrote:

> Thanks Matei. 
> 
> Using your pointers I can import data frrom HDFS, what I want to do now is 
> something like this in Spark: 
> 
> ----------------------- 
> import myown.mapper 
> 
> rdd.map (mapper.map) 
> ----------------------- 
> 
> The reason why I want this: myown.mapper is a java class I already developed. 
> I used to run it in Hadoop. It is fairly complex and relies on a lot of 
> utility java classes I wrote. Can I reuse the map function in java and port 
> it into Spark? 
> 
> Best regards, 
> Wei 
> 
> 
> --------------------------------- 
> Wei Tan, PhD 
> Research Staff Member 
> IBM T. J. Watson Research Center 
> http://researcher.ibm.com/person/us-wtan 
> 
> 
> 
> From:        Matei Zaharia <matei.zaha...@gmail.com> 
> To:        user@spark.apache.org, 
> Date:        06/04/2014 04:28 PM 
> Subject:        Re: reuse hadoop code in Spark 
> 
> 
> 
> Yes, you can write some glue in Spark to call these. Some functions to look 
> at: 
> 
> - SparkContext.hadoopRDD lets you create an input RDD from an existing 
> JobConf configured by Hadoop (including InputFormat, paths, etc) 
> - RDD.mapPartitions lets you operate in all the values on one partition 
> (block) at a time, similar to how Mappers in MapReduce work 
> - PairRDDFunctions.reduceByKey and groupByKey can be used for aggregation. 
> - RDD.pipe() can be used to call out to a script or binary, like Hadoop 
> Streaming. 
> 
> A fair number of people have been running both Java and Hadoop Streaming apps 
> like this. 
> 
> Matei 
> 
> On Jun 4, 2014, at 1:08 PM, Wei Tan <w...@us.ibm.com> wrote: 
> 
> Hello, 
> 
>  I am trying to use spark in such a scenario: 
> 
>  I have code written in Hadoop and now I try to migrate to Spark. The mappers 
> and reducers are fairly complex. So I wonder if I can reuse the map() 
> functions I already wrote in Hadoop (Java), and use Spark to chain them, 
> mixing the Java map() functions with Spark operators? 
> 
>  Another related question, can I use binary as operators, like Hadoop 
> streaming? 
> 
>  Thanks! 
> Wei 
> 
> 
>

Re: reuse hadoop code in Spark

Reply via email to