Patrick, I was wondering why one would choose for rdd.map vs rdd.foreach to execute a side-effecting function on an RDD.
-kr, Gerard. On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell <[email protected]> wrote: > > The second choice is better. Once you call collect() you are pulling > all of the data onto a single node, you want to do most of the > processing in parallel on the cluster, which is what map() will do. > Ideally you'd try to summarize the data or reduce it before calling > collect(). > > On Fri, Dec 5, 2014 at 5:26 AM, david <[email protected]> wrote: > > hi, > > > > What is the bet way to process a batch window in SparkStreaming : > > > > kafkaStream.foreachRDD(rdd => { > > rdd.collect().foreach(event => { > > // process the event > > process(event) > > }) > > }) > > > > > > Or > > > > kafkaStream.foreachRDD(rdd => { > > rdd.map(event => { > > // process the event > > process(event) > > }).collect() > > }) > > > > > > thank's > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
