Patrick,

I was wondering why one would choose for rdd.map vs rdd.foreach to execute
a side-effecting function on an RDD.

-kr, Gerard.

On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell <[email protected]> wrote:
>
> The second choice is better. Once you call collect() you are pulling
> all of the data onto a single node, you want to do most of the
> processing  in parallel on the cluster, which is what map() will do.
> Ideally you'd try to summarize the data or reduce it before calling
> collect().
>
> On Fri, Dec 5, 2014 at 5:26 AM, david <[email protected]> wrote:
> > hi,
> >
> >   What is the bet way to process a batch window in SparkStreaming :
> >
> >     kafkaStream.foreachRDD(rdd => {
> >       rdd.collect().foreach(event => {
> >         // process the event
> >         process(event)
> >       })
> >     })
> >
> >
> > Or
> >
> >     kafkaStream.foreachRDD(rdd => {
> >       rdd.map(event => {
> >         // process the event
> >         process(event)
> >       }).collect()
> >     })
> >
> >
> > thank's
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to