The second choice is better. Once you call collect() you are pulling all of the data onto a single node, you want to do most of the processing in parallel on the cluster, which is what map() will do. Ideally you'd try to summarize the data or reduce it before calling collect().
On Fri, Dec 5, 2014 at 5:26 AM, david <david...@free.fr> wrote: > hi, > > What is the bet way to process a batch window in SparkStreaming : > > kafkaStream.foreachRDD(rdd => { > rdd.collect().foreach(event => { > // process the event > process(event) > }) > }) > > > Or > > kafkaStream.foreachRDD(rdd => { > rdd.map(event => { > // process the event > process(event) > }).collect() > }) > > > thank's > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org