This is also discussed in the programming guide. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
On Wed, Jul 8, 2015 at 8:25 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com> wrote: > Thanks, Sean. > > "are you asking about foreach vs foreachPartition? that's quite > different. foreachPartition does not give more parallelism but lets > you operate on a whole batch of data at once, which is nice if you > need to allocate some expensive resource to do the processing" > > This is basically what I was looking for. > > > On Wed, Jul 8, 2015 at 11:15 AM, Sean Owen <so...@cloudera.com> wrote: > >> @Evo There is no foreachRDD operation on RDDs; it is a method of >> DStream. It gives each RDD in the stream. RDD has a foreach, and >> foreachPartition. These give elements of an RDD. What do you mean it >> 'works' to call foreachRDD on an RDD? >> >> @Dmitry are you asking about foreach vs foreachPartition? that's quite >> different. foreachPartition does not give more parallelism but lets >> you operate on a whole batch of data at once, which is nice if you >> need to allocate some expensive resource to do the processing. >> >> On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg >> <dgoldenberg...@gmail.com> wrote: >> > "These are quite different operations. One operates on RDDs in DStream >> and >> > one operates on partitions of an RDD. They are not alternatives." >> > >> > Sean, different operations as they are, they can certainly be used on >> the >> > same data set. In that sense, they are alternatives. Code can be >> written >> > using one or the other which reaches the same effect - likely at a >> different >> > efficiency cost. >> > >> > The question is, what are the effects of applying one vs. the other? >> > >> > My specific scenario is, I'm streaming data out of Kafka. I want to >> perform >> > a few transformations then apply an action which results in e.g. writing >> > this data to Solr. According to Evo, my best bet is foreachPartition >> > because of increased parallelism (which I'd need to grok to understand >> the >> > details of what that means). >> > >> > Another scenario is, I've done a few transformations and send a result >> > somewhere, e.g. I write a message into a socket. Let's say I have one >> > socket per a client of my streaming app and I get a host:port of that >> socket >> > as part of the message and want to send the response via that socket. >> Is >> > foreachPartition still a better choice? >> > >> > >> > >> > >> > >> > >> > >> > >> > On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote: >> >> >> >> These are quite different operations. One operates on RDDs in DStream >> and >> >> one operates on partitions of an RDD. They are not alternatives. >> >> >> >> >> >> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dgoldenberg...@gmail.com> >> wrote: >> >>> >> >>> Is there a set of best practices for when to use foreachPartition vs. >> >>> foreachRDD? >> >>> >> >>> Is it generally true that using foreachPartition avoids some of the >> >>> over-network data shuffling overhead? >> >>> >> >>> When would I definitely want to use one method vs. the other? >> >>> >> >>> Thanks. >> >>> >> >>> >> >>> >> >>> -- >> >>> View this message in context: >> >>> >> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html >> >>> Sent from the Apache Spark User List mailing list archive at >> Nabble.com. >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >>> For additional commands, e-mail: user-h...@spark.apache.org >> >>> >> > >> > >