Thanks, Sean. "are you asking about foreach vs foreachPartition? that's quite different. foreachPartition does not give more parallelism but lets you operate on a whole batch of data at once, which is nice if you need to allocate some expensive resource to do the processing"
This is basically what I was looking for. On Wed, Jul 8, 2015 at 11:15 AM, Sean Owen <so...@cloudera.com> wrote: > @Evo There is no foreachRDD operation on RDDs; it is a method of > DStream. It gives each RDD in the stream. RDD has a foreach, and > foreachPartition. These give elements of an RDD. What do you mean it > 'works' to call foreachRDD on an RDD? > > @Dmitry are you asking about foreach vs foreachPartition? that's quite > different. foreachPartition does not give more parallelism but lets > you operate on a whole batch of data at once, which is nice if you > need to allocate some expensive resource to do the processing. > > On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg > <dgoldenberg...@gmail.com> wrote: > > "These are quite different operations. One operates on RDDs in DStream > and > > one operates on partitions of an RDD. They are not alternatives." > > > > Sean, different operations as they are, they can certainly be used on the > > same data set. In that sense, they are alternatives. Code can be written > > using one or the other which reaches the same effect - likely at a > different > > efficiency cost. > > > > The question is, what are the effects of applying one vs. the other? > > > > My specific scenario is, I'm streaming data out of Kafka. I want to > perform > > a few transformations then apply an action which results in e.g. writing > > this data to Solr. According to Evo, my best bet is foreachPartition > > because of increased parallelism (which I'd need to grok to understand > the > > details of what that means). > > > > Another scenario is, I've done a few transformations and send a result > > somewhere, e.g. I write a message into a socket. Let's say I have one > > socket per a client of my streaming app and I get a host:port of that > socket > > as part of the message and want to send the response via that socket. Is > > foreachPartition still a better choice? > > > > > > > > > > > > > > > > > > On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote: > >> > >> These are quite different operations. One operates on RDDs in DStream > and > >> one operates on partitions of an RDD. They are not alternatives. > >> > >> > >> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dgoldenberg...@gmail.com> > wrote: > >>> > >>> Is there a set of best practices for when to use foreachPartition vs. > >>> foreachRDD? > >>> > >>> Is it generally true that using foreachPartition avoids some of the > >>> over-network data shuffling overhead? > >>> > >>> When would I definitely want to use one method vs. the other? > >>> > >>> Thanks. > >>> > >>> > >>> > >>> -- > >>> View this message in context: > >>> > http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html > >>> Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: user-h...@spark.apache.org > >>> > > >