Thanks, Sean.

"are you asking about foreach vs foreachPartition? that's quite
different. foreachPartition does not give more parallelism but lets
you operate on a whole batch of data at once, which is nice if you
need to allocate some expensive resource to do the processing"

This is basically what I was looking for.


On Wed, Jul 8, 2015 at 11:15 AM, Sean Owen <so...@cloudera.com> wrote:

> @Evo There is no foreachRDD operation on RDDs; it is a method of
> DStream. It gives each RDD in the stream. RDD has a foreach, and
> foreachPartition. These give elements of an RDD. What do you mean it
> 'works' to call foreachRDD on an RDD?
>
> @Dmitry are you asking about foreach vs foreachPartition? that's quite
> different. foreachPartition does not give more parallelism but lets
> you operate on a whole batch of data at once, which is nice if you
> need to allocate some expensive resource to do the processing.
>
> On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg
> <dgoldenberg...@gmail.com> wrote:
> > "These are quite different operations. One operates on RDDs in  DStream
> and
> > one operates on partitions of an RDD. They are not alternatives."
> >
> > Sean, different operations as they are, they can certainly be used on the
> > same data set.  In that sense, they are alternatives. Code can be written
> > using one or the other which reaches the same effect - likely at a
> different
> > efficiency cost.
> >
> > The question is, what are the effects of applying one vs. the other?
> >
> > My specific scenario is, I'm streaming data out of Kafka.  I want to
> perform
> > a few transformations then apply an action which results in e.g. writing
> > this data to Solr.  According to Evo, my best bet is foreachPartition
> > because of increased parallelism (which I'd need to grok to understand
> the
> > details of what that means).
> >
> > Another scenario is, I've done a few transformations and send a result
> > somewhere, e.g. I write a message into a socket.  Let's say I have one
> > socket per a client of my streaming app and I get a host:port of that
> socket
> > as part of the message and want to send the response via that socket.  Is
> > foreachPartition still a better choice?
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> These are quite different operations. One operates on RDDs in  DStream
> and
> >> one operates on partitions of an RDD. They are not alternatives.
> >>
> >>
> >> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dgoldenberg...@gmail.com>
> wrote:
> >>>
> >>> Is there a set of best practices for when to use foreachPartition vs.
> >>> foreachRDD?
> >>>
> >>> Is it generally true that using foreachPartition avoids some of the
> >>> over-network data shuffling overhead?
> >>>
> >>> When would I definitely want to use one method vs. the other?
> >>>
> >>> Thanks.
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >
>

Reply via email to