Oh sorry, just to be more clear - writing from the driver program is only safe if the amount of data you are trying to write is small enough to fit on memory in the driver program. I looked at your code, and since you are just writing a few things each time interval, this seems safe.
-Vida On Mon, Aug 18, 2014 at 4:25 PM, Vida Ha <v...@databricks.com> wrote: > Hi John, > > It seems like original problem you had was that you were initializing the > RabbitMQ connection on the driver, but then calling the code to write to > RabbitMQ on the workers (I'm guessing, but I don't know since I didn't see > your code). That's definitely a problem because the connection can't be > serialized and passed onto the workers. > > dstream.foreachRDD(rdd => { > // create connection / channel to source > rdd.foreach( > // tries write to rabbitMQ from the worker - there is a problem > since the connection cannot be passed to the workers. > element => // write using channel > ) > > }) > > > You then changed the code to open the connection on the workers, and write > out the data once per worker. This worked, but as you saw - you are > writing out the data multiple times. > > dstream.foreachRDD(rdd => { > rdd.foreachPartition(iterator => { > // Create or get a singleton instance of a connection / channel > iter.foreach(element => // write using connection / channel > > // This works and writes, but writes once per partition. > }) > }) > > > If I understand correctly, I believe what you want is to open the > connection on the driver, and write to RabbitMQ from the driver as well. > > dstream.foreachRDD(rdd => { > // Any Code you put here is executed on the driver. > rdd.foreach( > // Any code inside rdd.forEach is executed on the workers. > ) > > // This code will be executed back on the driver again. > // You can open the connection and write stats here. > }) > > > Does this clear things up? I hope it's very clear now when code in the > streaming example executes on the driver vs. the workers. > > -Vida > > > > > > On Mon, Aug 18, 2014 at 2:20 PM, jschindler <john.schind...@utexas.edu> > wrote: > >> Well, it looks like I can use the .repartition(1) method to stuff >> everything >> in one partition so that gets rid of the duplicate messages I send to >> RabbitMQ but that seems like a bad idea perhaps. Wouldn't that hurt >> scalability? >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Writing-to-RabbitMQ-tp11283p12324.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >