Having the driver write the data instead of a worker probably won't spread it up, you still need to copy all of the data to a single node. Is there something which forces you to only write from a single node?
On Friday, September 11, 2015, Luca <luke1...@gmail.com> wrote: > Hi, > thanks for answering. > > With the *coalesce() *transformation a single worker is in charge of > writing to HDFS, but I noticed that the single write operation usually > takes too much time, slowing down the whole computation (this is > particularly true when 'unified' is made of several partitions). Besides, > 'coalesce' forces me to perform a further repartitioning ('true' flag), in > order not to lose upstream parallelism (by the way, did I get this part > right?). > Am I wrong in thinking that having the driver do the writing will speed > things up, without the need of repartitioning data? > > Hope I have been clear, I am pretty new to Spark. :) > > 2015-09-11 18:19 GMT+02:00 Holden Karau <hol...@pigscanfly.ca > <javascript:_e(%7B%7D,'cvml','hol...@pigscanfly.ca');>>: > >> A common practice to do this is to use foreachRDD with a local var to >> accumulate the data (you can see it in the Spark Streaming test code). >> >> That being said, I am a little curious why you want the driver to create >> the file specifically. >> >> On Friday, September 11, 2015, allonsy <luke1...@gmail.com> wrote: >> >>> Hi everyone, >>> >>> I have a JavaPairDStream<Integer, String> object and I'd like the Driver >>> to >>> create a txt file (on HDFS) containing all of its elements. >>> >>> At the moment, I use the /coalesce(1, true)/ method: >>> >>> >>> JavaPairDStream<Integer, String> unified = [partitioned stuff] >>> unified.foreachRDD(new Function<JavaPairRDD<Integer, String>, Void>() >>> { >>> public Void call(JavaPairRDD<Integer, >>> String> arg0) throws Exception { >>> arg0.coalesce(1, >>> true).saveAsTextFile(<HDFS path>); >>> return null; >>> } >>> }); >>> >>> >>> but this implies that a /single worker/ is taking all the data and >>> writing >>> to HDFS, and that could be a major bottleneck. >>> >>> How could I replace the worker with the Driver? I read that /collect()/ >>> might do this, but I haven't the slightest idea on how to implement it. >>> >>> Can anybody help me? >>> >>> Thanks in advance. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> -- >> Cell : 425-233-8271 >> Twitter: https://twitter.com/holdenkarau >> Linked In: https://www.linkedin.com/in/holdenkarau >> >> > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau Linked In: https://www.linkedin.com/in/holdenkarau